Twin Neural Network Regression is a Semi-Supervised Regression Algorithm

Twin neural network regression (TNNR) is a semi-supervised regression algorithm, it can be trained on unlabelled data points as long as other, labelled anchor data points, are present. TNNR is trained to predict differences between the target values of two different data points rather than the targets themselves. By ensembling predicted differences between the targets of an unseen data point and all training data points, it is possible to obtain a very accurate prediction for the original regression problem. Since any loop of predicted differences should sum to zero, loops can be supplied to the training data, even if the data points themselves within loops are unlabelled. Semi-supervised training improves TNNR performance, which is already state of the art, significantly.


Introduction
Regression is the process of estimating the relationship between feature variables to outcome variables. It is one of the most central machine learning tasks in a wide range of scientific, analytic and industrial research and development disciplines. Regression analysis is considered to be a supervised machine learning task, where one is given labelled training data from which a model is trained to make predictions of the labels of unlabelled data points. In many applications, there is few labelled training data available, while unlabelled data is much easier to obtain. In addition to labelled data, unlabelled data can be leveraged to train machine learning models in the context of semi-supervised learning [23,6]. There are two different cases of semi-supervised learning, which differ by the availability of the data for which one wants to obtain predictions during the training phase. The goal of inductive semi-supervised learning is to find the function that maps a feature variable to its outcome variable from a combined set of labelled and unlabelled data. This function can then be used to make predictions on new data points that are not available during the training phase. The goal of transductive semi-supervised learning is to infer the correct labels for the given unlabelled data, which is already present during the training phase.
Most of existing semi-supervised learning algorithms focus on classification problems, which are primarily based on continuity and cluster assumptions. Continuity assumes that neighboring data points likely share the same label, clustering of the data makes it possible to draw decision boundaries in low density regions [23,6]. Continuity and cluster assumptions cannot be leveraged in semisupervised regression to the same extent as in semi-supervised classification. That is why it is more difficult to develop semi-supervised regression methods. This obstacle explains the scarcity of publications on semi-supervised regression and why research in semi-supervised regression was not able to achieve the same level of success as semi-supervised classification.
Semi-supervised regression was a research area before neural networks became popular in 2012 [13]. However, only very few research articles have since included neural networks as part of their proposed semi-supervised regression pipelines, mostly combined with other machine learning algorithms [11,14]. While neural networks are not always the optimal choice, their performance ceiling is much higher due to the universal approximation theorem [8,10], the incorporation of feature extraction into the learning problem, and the efficient training through gradient descent and backpropagation.
In this article we develop a semi-supervised training procedure for twin neural network regression (TNNR) [18]. TNNR is based on an architecture similar to Siamese neural networks [5,2] in the sense that it takes two inputs. TNNR is trained to predict the differences between the labels of the input pair. The solutions of the original regression problem is then obtained by creating an ensemble of all predicted differences between a new data point and all labelled training data points plus their labels. TNNR is normally trained on pairs, here we explain how TNNR can be trained on triples of data points which can be unlabelled. These triples form a loop along which predictions must sum to zero. One can implement this constraint using a suitable loss function to transform TNNR into a semi-supervised regression algorithm. The strengths of TNNR can be summarized as follows: 1. TNNR attempts to circumvent the bias-variance tradeoff by internal ensembling which translates to a smaller expected error [18].
3. TNNR can be trained in a semi-supervised manner on loops containing unlabelled data points [This work].

Prior Work
Twin neural networks are inspired by Siamese neural networks, these networks were introduced to solve an infinite class classification problem as it occurs in finger print recognition or signature verification [5,2]. Siamese neural networks consist of two identical neural networks which project an input pair into a latent space. The similarity of two inputs is determined based on the distance in latent space. The twin neural network in TNNR also takes a pair of inputs to predict the difference between the labels [18]. These networks differ from Siamese networks in two properties: First, there is no (or only minimal) weight sharing between neurons acting on similar parts of each member of the input pair. Second, they do not project into a latent space but are fully connected. TNNR predictions can be transformed to a solution of the original regression problem via averaging over all predicted differences between an unlabelled data point and all labelled training data points, plus their labels. TNNR attempts to circumvent the bias-variance tradeoff, since the bias of neural networks is low given sufficiently many parameters and the variance can be reduced by the implicit ensemble of different predictions. TNNR also provides prediction uncertainty estimates through the variance of the predictions.
Most previous research in semi-supervised regression can be classified into three categories: Cotraining, kernel and graph based regression methods [12]. Co-training denotes alternating training between two different regressors (the original idea is to use different feature sets for each regressor) where the predictions of one helps create a training set for the other, this process is repeated until convergence [4]. While initially invented for semi-supervised classification, there has been progress to adapt co-training to semi-supervised regression [20,17,9]. Semi-supervised support vector machines [3,7] have been extended for semi-supervised regression [19]. In graph based semi-supervised regression methods labelled and unlabelled data are considered nodes on a graph. Similar nodes are connected with edges along which information propagates [21,22,15,15].
x y=f(x) x 1 x 2 y 2 -y 1 =F(x 2 ,x 1 ) f(x 1 ) Figure 1: TNNR formulation of a regression problem: In order to solve a traditional regression problem, a neural network is trained to map a data point x to its target value f (x) = y. A TNN takes two inputs x 1 and x 2 and is trained to predict the difference between the target values F (x 2 , x 1 ) = y 2 − y 1 . This difference can be employed as an estimator for y 2 = F (x 2 , x 1 ) + y 1 given a labelled anchor point (x 1 , y 1 ). In contrast to a traditional neural network f , a TNN F must satisfy loop consistency, predictions along each loop sum to zero: 3 Twin Neural Network Regression

Reformulation of the Regression Problem
We are given a training set of n labelled data points X train = (x train ). The goal is to find a function f such that f (x i ) = y i , which generalizes to unseen data X test to make an accurate prediction for the labels Y test . TNNR aims to solve a reformulation of the original regression problem, see fig. 1. Given a pair of data points ( we train a neural network to find a function F to predict the difference The TNN F provides a solution to the original regression problem of predicting y i by evaluating is an anchor whose label is known. Every training data point x train j ∈ X train can be employed as anchor. Since ensembles of predictions are more accurate than single predictions, the best estimate for the solution of the original regression problem is obtained by averaging This ensemble contains twice the size of the training set of predictions of differences y i − y j for a single prediction of y i .
Since training data points are not just separated by infinitesimal perturbations the created ensemble is more diverse that a pseudo ensemble generated through infinitesimal perturbations of model weights [1]. Ensembles of different TNN models, each containing an implicit ensemble of predictions themselves is even more accurate than an implicit ensemble from a single TNN [18].

Semi-Supervised Learning on Loops
If we have in addition to a labelled training data set of size m an unlabelled data set of size n available during the training phase, TNNR can be used as a semi-supervised regression method. TNNR can be trained on unlabelled data points which are connected along loops, see fig. 2. Loops consist of several data points which can be labelled or unlabelled, the corresponding loop labels express the differences between the target labels of each original data point. For simplicity, we restrict ourselves to loops of size three. Since all higher order loops can be combined from loops of size three, we do not believe they would lead to an improved performance. In our case, a loop can be expressed as triple (x i , x j , x k ) with data labels (y i , y j , y k ) and corresponding loop labels (y i − y j , y j − y k , y k − y i ), if a label of a data point is unknown during the training phase the corresponding loop labels are left blank. As long as the labels are known, each pair of original data points (x i , x j ) drawn from the triple can be used to train the TNN to predict the difference y i − y k . Furthermore, predictions on translates to a TNNR training loop which has labels (y 1 − y 2 , y 2 − y 3 , y 3 − y 1 ), depicted as loop A. Loop A is used as supervised training data for TNNR with the objective to minimize the MSE between the TNN outputs and each label of loop A. In loops B, C and D at least one data point is unlabelled. On these loops, TNNR is trained in unsupervised manner to enforce loop consistency F (x 1 , along each loop containing (x i , x j , x k ) points should satisfy the loop consistency condition This condition can be used in order to train the TNN in an unsupervised manner since it does not require the presence of any labels. As shown in fig. 2 we consider four types of loops. We only use loops where all x i have a label y i as supervised training data (loop A). If at least one of the x i is unlabelled (loop B, C, D) we choose to use the loop as unsupervised training data. TNNR cannot be used as a purely unsupervised regression algorithm, some labelled data points must be contained in the training set.

Training Architecture
The architecture used to train the TNN is shown in fig. 3. Three copies of the main TNN are trained simultaneously. During training these networks share their weights. The input of the full architecture are loops along triples (x i , x j , x k ), which can either be fully labelled or unlabelled/partially labelled. The input of the three TNNs are the three possible pairs drawn from the loop, so that the prediction of the full training model is a triple Depending on the loop type fig. 2 of the input, labelled loops (loop A) are used in the mean squared error loss function (MSE objective) to train the TNN to predict the differences between each data label (y i −y j , y j −y k , y k −y i ).
Unlabelled loops (loops B, C, D) are used to train the TNN to enforce loop consistency eq. 3. For this purpose the weights and biases in the TNNs are updated via gradient descent by minimizing the batchwise estimate of if the loops are labelled and if the loops are unlabelled or partially labelled. The loop loss function can be seen as a MSE objective with noisy labels for each of the F (x i , x j ) provided by the other two predictions The combined loss function is where the optimal loop weight Λ is a hyperparameter that needs to be optimized by evaluating the loss on a validation set.
Our TNNs all consist of two hidden layers, each with 192 neurons in these layers. Each hidden layer uses rectified linear units as activation functions while the output is a linear neuron. We do not Let us discuss if it is possible to relate TNNR to the existing classes of semi-supervised regression methods which are co-training, kernel or graph based [12]. TNNR cannot be classified into either of these classes, however some ideas might appear similar. In a vague way, it is possible to argue that TNNR is similar to co-training since in the loop loss function two predictions act as a noisy label for the third prediction. Kernel methods make use of some sort of similarity kernel function, TNNR provides a similarity measure in the label space through eq. 1. Further, TNNR can be connected to graph based methods, since the TNNR training loops are minimal sub graphs containing three nodes and three edges.

Data Preparation
We examine the performance of TNNR on different regression data sets: bio conservation (BC), boston housing (BH), concrete strength (CS), energy efficiency (EE), RCL circuit (RCL), red wine quality (WN), test function (TF), red wine quality (WN), Wheatstone bridge (WSB) and yacht hydrodynamics (YH). The common data sets can be found online at [16]. The scientific data sets are simulations of mathematical equations and physical systems. TF is a polynomial function combined with a sin function. RCL is a simulation of the electric current in an RCL circuit and WSB a simulation of the Wheatstone bridge voltage. More details can be found in the supplementary material.
We perform two different experiments to examine TNNR as transductive and inductive semisupervised learning algorithm. To focus solely on transductive learning, in the first experiments all data is split into 80% labelled training, 10% unlabelled validation and 10% unlabelled test data, all of which is used during the training phase. To examine inductive learning in addition to transductive learning we perform a second experiment with 30% labelled training, 50% unlabelled validation and 10% unlabelled test data, which is used during training and 10% unlabelled data on which the performance of inductive semi-supervised learning is evaluated. We normalize and center the input features to a range of [−1, 1] based on the training data. As outlined in section 3.3, labelled and unlabelled data are combined to form loops, which are supplied to the training architecture fig. 3 through a generator from which all possible loops are sampled. All resulting RMSEs in the figures and tables are produced by repeating the experiment 25 times which random splits of the underlying data, the same random splits are used while varying the hyperparameter Λ.

Experiment 1: Transductive Semi-Supervised Learning
The first experiment focuses on transductive semi supervised learning, where the goal is to predict labels of unlabelled data which is present during the training phase. First, for the example of the boston housing data set, we include all possible loop types depicted in fig. 2 into the training data and examine if the effect on the RMSEs of the training, validation and test sets. In fig. 4, we can see that the validation and test set RMSE can be lowered by tuning the loop weight Λ to a certain finite value. However, at that point the training RMSE increases above its base value. This indicates that including the loops containing unlabelled data points is similar to regularization and reduces overfitting to the labelled training set. Secondly, we determine the effects of including several loop types into the training phase at the example of the boston housing data set. In fig. 4 we can see that all loops have a positive effect on the RMSE, however, some loops seem to have a slightly better effect than others. Loops B and C tend to reduce the RMSE more than Loop D. This is most likely because Loops B and C contain the final predictions of differences between elements of the labelled training set and the unlabelled test set. The RMSE curves are visualized in the supplementary material. Due to noise, sometimes the optimal choice of Λ based on the validation RMSE slightly differs from the optimum for the test RMSE. In some cases (CS,WSB,YH) this prevents us from uniquely identifying an optimal finite Λ. In these cases we cannot claim an improvement of semi-supervised learning over purely supervised learning.
In some data sets the RMSE shows two local minima as a function of Λ, this effect is separate from noise and can be reproduced, but we do not know of the cause of this observation.
On 6 out of 9 data sets the test RMSE can be significantly reduced by semi-supervised transductive learning. An interesting observation is that the data sets which did not benefit from internal TNNR ensembling in the original TNNR paper [18], namely bio conservation and red wine quality, are among those that benefit most from semi-supervised learning. That means TNNR ensembling combined with semi-supervised training on loops improved the RMSE on all considered data sets significantly compared to previous state-of-the art regression methods.

Experiment 2: Transductive and Inductive Semi-Supervised Learning
In the second experiment, the goal is to compare transductive and inductive semi-supervised learning using TNNR. The transductive RMSEs are evaluated on the unlabelled test data which was supplied to the training phase, while the inductive RMSEs are evaluated on the the unlabelled test set which was not contained in the training loops. Further, the validation set was part of the training loops. In table 2 the improvements due to semi-supervised training are displayed, while the RMSE curves can be found in the supplementary material. From this table we conclude that TNNR can be successfully used as a transductive or inductive semi-supervised learning method.

Summary
We have presented a method to train twin neural network regression (TNNR) in a semi-supervised manner. While TNNR is already a state of the art regression method, we observe a significant improvement by semi supervised training on loops containing differences between labelled and/or unlabelled data. On 6 out of 9 data sets the test RMSE can be significantly reduced by semi-supervised learning. Data sets which do not benefit from internal ensembling [18] benefit among the most from semi-supervised learning. Thus, TNNR ensembling combined with semi-supervised training on loops beat all considered supervised state-of-the-art methods on all considered data sets. This outperformance is persistent no matter if it is achieved via transductive or inductive semi-supervised learning. The effective cost of TNNR compared to traditional supervised neural networks is the quadratic time scaling of the former versus the linear time scaling of the latter.
In the future, it might be helpful to compare different semi-supervised regression methods to each other. It might be possible to improve the results by continuously increasing the loop weight Λ during training. It is very likely that certain specific loops improve the performance more than others, after identifying these loops the learning process could be weighted in their favor. While in this work only the training process is assisted by loops, it might be fruitful to explore if the loops can help during the inference phase. The test function (TF) data set created from the equation

Supplementary Material
and zero noise.
The output in the RCL circuit current data set (RCL) is the current through an RCL circuit, modeled by the equation I0 = V0 cos(ωt)/ R 2 + (ωL − 1/(ωC)) 2 with added Gaussian noise of mean 0 and standard deviation 0.1.
The output of the Wheatstone Bridge voltage (WSB) is the measured voltage given by the equation with added Gaussian noise of mean 0 and standard deviation 0.1. Figure 5: Effect of transductive semi-supervised training on RMSEs of different data sets. We train on 80% labelled training data, 10% unlabelled validation data and 10% is unlabelled test data.The loop weight denotes the strength of the loop constraint versus the MSE loss function, see eq. 6. The blue area is indicative for the standard error. Figure 6: Effect of transductive semi-supervised training on RMSEs of different data sets. We train on 30% labelled training data, 10% unlabelled validation data and 50% is unlabelled test data, whose targets are predicted using TNNR as a transductive semi-supervised learning method. The remaining 10% unlabelled test data is not used for training, its labels are predicted using TNNR as an inductive semi-supervised learning method. The loop weight denotes the strength of the loop constraint versus the MSE loss function, see eq. 6. The blue area is indicative for the standard error. Figure 7: Effect of inductive semi-supervised training on RMSEs of different data sets. We train on 30% labelled training data, 10% unlabelled validation data and 50% is unlabelled test data, whose targets are predicted using TNNR as a transductive semi-supervised learning method. The remaining 10% unlabelled test data is not used for training, its labels are predicted using TNNR as an inductive semi-supervised learning method. The loop weight denotes the strength of the loop constraint versus the MSE loss function, see eq. 6. The blue area is indicative for the standard error.