Twin Neural Network Regression

We introduce twin neural network (TNN) regression. This method predicts differences between the target values of two different data points rather than the targets themselves. The solution of a traditional regression problem is then obtained by averaging over an ensemble of all predicted differences between the targets of an unseen data point and all training data points. Whereas ensembles are normally costly to produce, TNN regression intrinsically creates an ensemble of predictions of twice the size of the training set while only training a single neural network. Since ensembles have been shown to be more accurate than single models this property naturally transfers to TNN regression. We show that TNNs are able to compete or yield more accurate predictions for different data sets, compared to other state-of-the-art methods. Furthermore, TNN regression is constrained by self-consistency conditions. We find that the violation of these conditions provides an estimate for the prediction uncertainty.


Introduction
Regression aims to solve one of two main classes of problems in supervised machine learning. It is the process of estimating the function that maps feature variables to an outcome variable. Regression can be applied to solve a wide range of problems. In everyday life, one may wish to predict the sales price of a house (Park & Bae, 2015) or the number of calories in a meal (Chokr & Elbassuoni, 2017). In business and industry, it could be desirable to estimate stock market changes (Patel et al., 2015) or the sales numbers of a certain product (Sun et al., 2008). Within the natural sciences, regression has been applied to a rich variety of problems, these include molecular formation energies (Rupp et al., 2012), electronic charge distributions (Ryczko et al.,1 Perimeter Institute for Theoretical Physics, Canada 2 University of Ottawa, Canada 3 Vector Institute for Artificial Intelligence, Canada 4 University of Waterloo, Canada 5 National Research Council of Canada, Canada. Correspondence to: Sebastian J. Wetzel <swetzel@perimeterinstitute.ca>. 2019), inter-atomic forces (Schütt et al., 2017), electronic band-gaps (Chandrasekaran et al., 2019), and plasmonic response functions (Malkiel et al., 2018).
Indeed the wide range of the applications of regression makes it part of the standard curriculum across all of the quantitative domains (applied math, computer science, engineering, economics, the physical sciences, etc.). There are many existing algorithms which solve the regression problem, including linear regression, Gaussian process regression, random forest regression, xgboost, and artificial neural networks, among others.
Regression problems require accurate and reliable solutions. Hence, in addition to the prediction itself, it is desirable to estimate the associated error or uncertainty (Krzywinski & Altman, 2013;Ghahramani, 2015). Such uncertainty signals can be used to decide when it is safe to let a model make decisions in the absence of expert supervision. For example, when a self-driving car experiences unfamiliar road conditions, model uncertainty can be used as a signal that it must alter its behavior (SAE, 2018). This could mean taking a different path, slowing down, or in the extreme, stopping until a human driver can take over. Similarly, in medical diagnostics, automated classification and analysis of diagnostic imaging can improve reliability and reduce costs (McKinney et al., 2020). However, such tools can only be trusted if they have the ability to gauge their own accuracy, and will only make recommendations when the expected prediction accuracy is above a safe threshold. Recent successes with surrogate models (Ulissi et al., 2017;Kasim et al., 2020) require an accurate estimate of model uncertainty. Similarly, active learning algorithms rely on an agent's self assessment ability; low confidence in a model can be used as a trigger for consulting the oracle (Zhang et al., 2019;Zhong et al., 2020). Methods to estimate the prediction uncertainty are often based on ensembles of predictions where the variation in output across models is used as a proxy for the uncertainty. This includes real ensembles, by combining the predictions of multiple models (Hansen & Salamon, 1990;Krogh & Vedelsby, 1994), or pseudo ensembles (Gal & Ghahramani, 2015;Bachman et al., 2014) that are obtained by perturbing certain parameters in the data or the model. While pseudo ensembles have the advantage of requiring no overhead in training time, real ensembles (Tran et al., 2020b) yield a much better prediction accuracy.
In this paper we present a regression algorithm based on twin neural networks (TNN) (historically known as Siamese neural networks) that leverage sample-to-sample comparisons while making their predictions. Twin neural networks were introduced in (Baldi & Chauvin, 1993) for finger print identification and in (Bromley et al., 1994) for hand written signature identification. They are popularly used in few shot learning (Koch et al., 2015) or facial recognition applications (Taigman et al., 2014).
Our method naturally produces an ensemble of predictions when it compares the predictions between a new unseen data point and all training data points. It combines the strengths of real and pseudo ensembles. On one hand, it creates a large number of predictions (twice size of the training data set) at little additional cost in training time compared to a traditional neural network. On the other hand, as a real ensemble it significantly increases the prediction accuracy.
TNN regression also provides estimates of model uncertainty by construction of a network topology which allows for self-consistency conditions. A violated check on these conditions can be interpreted as a decrease in prediction accuracy. These checks include, among others, the standard deviation associated with the prediction.
We first describe the approach, demonstrate its performance on well known data sets, and finally examine its self-consistency conditions.
The main contributions of the paper of the paper are the development of a new regression algorithm, which produces a large ensemble of predictions at low cost. This algorithm is shown to be competitive and outperforms state-of-the-art on many datasets. Further, self-consistency conditions can be used to estimate the prediction error.

Prior Work
Twin (or Siamese) neural network (TNN) regression (fig. 1) yields high accuracy and uncertainty estimates based on the intrinsic ensemble it leverages.
TNNs were originally developed as an approach for the identification of fingerprints (Baldi & Chauvin, 1993) and handwritten signature verification (Bromley et al., 1994). More recently, when coupled with deep convolutional architectures (LeCun et al., 1989), TNNs have been used for facial recognition (Taigman et al., 2014), few shot learning (Koch et al., 2015), and object tracking (Bertinetto et al., 2016). The idea of pairwise similarity has also been shown to be an approach for unlabeled classification (Bao et al., 2018). TNNs have previously been used in the regressive task of extracting a camera pose from images (Doumanoglou et al., 2016). Other uses of TNNs with images and regression have been focused in the medical domain as an estimator for disease severity (Li et al., 2020). The ability of TNN to extract similarities from data can also be used to determine conservation laws in physics (Wetzel et al., 2020).
Uncertainty assessments for linear regression are well established. Similarly, for stochastic processes, there are standard techniques which can estimate the uncertainty for a model. Gaussian processes (GP) (Bartók et al., 2010;Koistinen et al., 2016;Simm & Reiher, 2018;Proppe et al., 2019) naturally provide an estimate of uncertainty, but the cost of training grows quickly with the number of training samples; GP are impractical for large data-sets, although there has been recent progress in this direction (Liu et al., 2019). GP also are unable to easily incorporate new data; it is necessary to retrain the model fully for each new data point or observation, making online learning very costly.
Conversely, there is not yet a single, established protocol for quantifying error for (deep) neural networks, particularly for the case of regression. An early and straightforward approach to uncertainty estimation is the use of ensembles.
Ensemble methods are commonly used in regression (and classification) tasks as a means of improving the accuracy of prediction (Naftaly et al., 1997) and solving the biasvariance tradeoff (Bishop, 2006) by combining the predictions of different models. Ensembles (Hansen & Salamon, 1990;Krogh & Vedelsby, 1994) can be produced in different ways such as repetition of training while changing trainingvalidation splits or even sampling intermediate models along the training trajectory (Swann & Allinson, 1998;Xie et al., 2013;Huang et al., 2017). Sampling along a training trajectory is efficient in that it generates approximately 5 ensemble members in the time it takes to train one traditional neural network.
The disagreement between models in an ensemble can be used as a signal for confidence among the set. Recent work has highlighted the problems associated with ensembles as a method for uncertainty estimation (Ashukha et al., 2020;Lakshminarayanan et al., 2017;Cortés-Ciriano & Bender, 2019). There is less theoretical justification for the reliability of errors from such approaches compared to GP. Intuitively it makes sense that a mismatch of models suggests the output cannot be trusted, but it is less clear that the magnitude of this mismatch can be assigned to a particular value of uncertainty.
An alternative method to a real ensemble is to sample the output of a network subject to perturbations, for example through the introduction of errors. This can be regarded as a form of pseudo ensemble (Bachman et al., 2014). MC dropout (Gal & Ghahramani, 2015) samples a network where nodes are randomly deactivated. In this case, error signals come from the fact that the collective response needed to overcome de-activations depends on the distribu-x y=f(x) x 2 x 1 y 2 =F(x 2 ,x 1 )+y 1 y 1 y 3 y 2 y 1 -y 2 y 2 -y 3 y 3 -y 1 Figure 1. Reformulation of a regression problem: In the traditional case a neural network is trained to map an input x to its target value f (x) = y. We reformulate the task to take two inputs x1 and x2 and train a twin neural network to predict the difference between the target values F (x2, x1) = y2 − y1. Hence, this difference can be employed as an estimator for y2 = F (x2, x1) + y1 given an anchor point (x1, y1).
tion of data itself. If data far away from the training regime is used, regression becomes unreliable (this is a well known effect). It also means that the learned corrections (e.g. the response of the remaining nodes) become less reliable -hence the variance observed when sampling the network output increases. MC dropout is successful because the topology is constructed implicitly to signal whether the output can be trusted.
Other methods for estimating model uncertainty and error include discriminant analysis (Morais et al., 2019), resampling (Musil et al., 2019), scoring rules (Gneiting & Raftery, 2007;Dawid & Musio, 2014), and domain specific metrics (Peterson et al., 2017;Liu et al., 2018;Tran et al., 2020a). Finally, it was recently shown that the projection of data into the latent space of a model can be used as a proxy for uncertainty; a close distance to one of the training points is correlated with lower error (Janet et al., 2019).

Reformulation of Regression
In a regression problem we are given a training set of n data points X train = (x train 1 , ..., x train n ) and target values Y train = (y train 1 , ..., y train n ). The task is to find a function f such that f (x i ) = y i . Further, we require that the function f does generalize to unseen data X test with labels Y test . In the following we reformulate this regression problem. Given a pair of data points (x train i , x train j ) we train a neural network ( fig. 1) to find a function F to predict the difference This neural network can then be used as a solution for the original regression problem y i = F (x i , x j ) + y j . In this setting we call (x j , y j ) the anchor for the prediction of y i . This relation can be evaluated on every training data point x train j such that the best estimate for the solution of the regression problem is obtained by averaging (3) The first advantage of the reformulation is that it creates an ensemble of twice the size of the training set of predictions of differences y i − y j for a single prediction of y i . While ensembles are in general costly to produce, the TNN regression intrinsically yields a very large ensemble at little additional training cost.
In general, we do not expect the ensemble diversity of the TNN regression to be similar to traditional ensembles. The reason is that the prediction of a regression target y i is based on different predictions for differences y i − y j due to multiple anchor points. This allows us to combine the TNN regression with any traditional ensemble method to achieve an even more accurate prediction.
Each prediciton of y i is made from a finite range of differences y i − y j and the anchor data points differ by more than an infinitesimal perturbation. Hence, the TNN regression ensemble is not just a pseudo ensemble that is obtained by small perturbations of the model weights.
The intrinsic ensembles of the TNNs are not conventional ensembles. Like k-nearest neighbor regression or support vector regression the prediction is formed by leveraging comparison between a new unseen datapoint and several support vectors or nearest neighbors belonging to the training set. However, in contrast to these algorithms, TNN regression can be seen from a single neural network perspective with weight diversity. Let's consider the prediction y i = F (x i , x j ) + y j , we can consider x i as input to a traditional model to predict y i , while x j can be understood as auxiliary parameters which influence the weights. The offset y j can be seen as changing the bias of the output layer.
In principle a reliable estimation of a regression target through eq. 3 might be prone to a larger error compared to a traditional estimation. The reason for this is that the TNN regression must make predictions on two data points and thus also accumulates the errors of both inputs. However, since we average over the whole training set the errors on the anchor data points are uncorrelated and thus average out leading to a suppression by a factor of 1/ √ 2n, where n is the training set size.

Self-Consistency Conditions
The twin neural network predicts the difference between two regression targets. For an accurate prediction this requires the satisfaction of many self-consistency conditions, see fig. 1. Any closed loop of difference predictions F (x i , x j ) = y i − y j sums up to zero. Any violation of such a condition is an indication of an inaccurate prediction. In principle there are several ways to harness this self-consistency condition for practical use. First, it can be used to estimate the magnitude the prediction error. Second, it could be utilized to force the neural network to satisfy these conditions during training. Finally, it can enable one to use the predictions on previous test data as anchor points for new predictions.
The smallest loop only contains two data points x i , x j for which an accurate TNN needs to satisfy While training, in each batch we include both pairs (x i , x j ) and its mirror copy (x j , x i ) to enforce the satisfaction of this condition while training. The predictions on any three data x i , x j , x k points should satisfy For x i , x j ∈ X train the target values y i , y j are known. Thus, this condition becomes equivalent to the statement that the prediction of y k must be the same on any two different anchor points (x i , y i ) and (x j , y j ).
This condition is trivially enforced during training. We examine the relation of magnitudes of the violations of these conditions and the prediction error in section 4.2. To this end we employ the ensemble of predictions and calculate the standard deviation corresponding to eq. 4 and eq. 6 and find a distinct correlation with the out-of-domain prediction error .

Twin Neural Network Architecture
The reformulation of the regression problem does not require a solution by artificial neural networks. However, neural networks scale favorably with the number of input features in the data set. We employ the same neural network for all data sets. Our TNN takes a pair of inputs (x i , x j ), where each input is connected to the fully connected neural network with two hidden layers and a single output neuron. Each hidden layer consists of 64 neurons each with a relu activation function. On data sets containing hierarchical structures, such as image data sets or audio recordings, it is helpful to include shared layers that only act on one element of the input pair. This is commonly used in few shot learning in image recognition (Bromley et al., 1994). We optimize a common architecture that works well for all data sets considered in this work. We examined different regularization methods like dropout and L 2 regularization and found that in some cases a small L 2 penalty improves the results. More details can be found in the supplementary materials. The improvement was not statistically significant or uniform among different splits of the data, which is why our main results omit any regularization. The training objective is to minimize the mean squared error on the training set. For this purpose we employ standard gradient descent methods adadelta (and rmsprop) to minimize the loss on a batch of 16 pairs at each iteration. We stop the training if the validation loss stops decreasing.
The single feed forward neural network (ANN) that we employ for our comparisons has a similar architecture as the TNN. This means it has the same hidden layers and we examined the same hyperparameters. The convolutional neural networks are build on this architecture with the first two dense layers exchanged by convolutional layers of 64 neurons and filter size 3. The neural networks are robust with respect to changing the specific architectures in the sense that the results do not change significantly. The neural networks were implemented using keras (Chollet et al., 2015) and tensorflow (Abadi et al., 2015).

Data Preparation
We examine the performance of TNN regression on different data sets: Boston housing (BH), concrete strength (CS), energy efficiency (EE), yacht hydrodynamics (YH), red wine quality (WN), Bio Conservation (BC),random polynomial (RP), RCL circuit (RCL), Wheatstone bridge (WSB) and the Ising Model (ISING). The common data sets can be found online at (UCI, 2020). The science datasets are simulations of mathematical equations and physical systems. RP is a polynomial of degree two of five input features and random coefficients. RCL is a simulation of the electric current in an RCL circuit and WSB a simulation of the Wheatstone bridge voltage. ISING, a spin system on a lattice of size 20 × 20 and the corresponding energies are used to demonstrate an image regression problem. More details can be found in the supplementary matierial.
All data is split into 90% training, 5% validation and 5% test data. Each run is performed on a randomly chosen different split of the data. We normalize and center the input features to a range of [−1, 1] based on the training data. In the context of uncertainty estimation we further divide all data based on their regression targets y. In this manner we choose to exclude a certain range from the training data. Hence, we can examine the performance of the neural networks outside of the training domain.
While the data can be fed into a standard ANN in a straightforward manner, one must be careful in the preparation of the TNN data. Starting with a training data set of n data points we can create n 2 different pairs of training data pairs for the TNN. Hence, the TNN has the advantage of having access to a much larger training set than the ANN. However, in the case of a large number of training examples, it does not make sense to store all pairs due to memory constraints. That is why we train on a generator which generates all possible pairs batchwise before reshuffling.

Prediction Accuracy
We train random forests(RF), xgboost(XGB), traditional single neural networks ANNs, TNNs and ensembles to solve various regression tasks outlined in section 3.4. The hyperparameters of RF and XGB are optimized for each dataset  Over this interval the model has low uncertainty measured by the standard deviation of the TNN ensemble prediction. This is equivalent to the satisfaction of the self-consistency condition eq. 6. Conversely, within the interval which was obscured during training, the performance of the model is poor. The corresponding higher uncertainty or violation of the self-consistency conditions is a signal that the model is performing poorly.
via cross-validation. The performance on these data sets is measured by the root mean square error (RMSE) which is shown in table 1. Each result is averaged over 20 different random splits of the data; in this manner we find the best estimate of the RMSE and the according standard error. We also produce explicit ensembles (E) by training the according ANN and TNN 20 times. This means each RMSE in the table requires training 20 times or 400 times for the ensembles (E).
In table 1 we see that TNNs outperform single ANNs and even ensembles of ANNs on all data sets except BC, we assume this is an statistical outlier. Creating an ensemble of TNNs increases the performance even further. The significance of the difference in performance is clearly supported by the comparably small standard error. On discrete data especially YH, WN and BC, XGB slightly outperforms ANNs at a much shorter training time. However, TNNs are able to compete. In science data sets, due to the continous variables, neural network based methods beat tree based methods significantly. On image data convolutional neural networks outperform RF and XGB even though ISING is discrete. The general trend is that TNNs outperform ANNs. However, the TNN takes 15 to 40 times longer to converge compared to single ANNs. In the case of Boston Housing this would translate to approximately 1000 ensemble members in 40× the time, or 25 ensemble members per time equivalent.
Convolutional TNN architectures as employed for the Ising Model energy predictions are nontrivial, the present result presents a proof of concept in (Authors, 2020) where it is further explained and extended.
The out-performance of TNNR over other regression algorithms is based on exploiting the pairing of data points to create a huge training data set and a large ensemble during inference time. If the training set is very large, the number of pairs increases quadratically to a point where the TNN will in practice converge to a minimum before observing all possible pairs. At that point the TNN begins to lose its advantages in terms of prediction accuracy. A visualization of this fact can be seen in fig. 2. In this figure different ANN and TNN architectures are applied to the RP data set. One may observe the performance improvement in terms of lower RMSE of all neural networks when increasing the number of training data points. The TNN achieves a lower RMSE than the ANN in small and medium sized data sets.The TNN finds a plateau sooner than the ANN such that both algorithms perform similarly well in a regime of betweem 60000 and 100000 data points. We infer that if the training set is sufficiently large an ANN starts to compete with or outperform the TNN.

Prediction Uncertainty Estimation
Equipped with a regression method that yields a huge ensemble of predictions constrained by self-consistency conditions we examine reasonable proxies for the prediction error.
In that sense we must distinguish between in-domain prediction uncertainty and out-of-domain prediction uncertainty. The in-domain denotes data from the same manifold as the training data. However, if the test manifold differs from the training manifold one is concerned with the out-of-domain data. This is the case in interpolation or extrapolation, or if certain features in the test data exhibit a different correlation as in the training data.
Consider for example the polynomial function of one variable ( fig. 3) as an example of an out-of-domain interpolation problem. The TNN is very accurate on the test set as long as one stays on the training manifold. As soon as the test data leaves the training manifold the prediction becomes significantly worse. While the prediction resembles a line connecting the endpoints of the predictions on the training manifold it fails to capture the true function. Since the difference between the true function and the regression result is consistently larger than the standard deviation, we further conclude that the standard deviation of the TNN regression result cannot be used as a direct estimate for the prediction uncertainty.
In the following we examine the possibility of a meaningful estimation of the prediction uncertainty on the BH, CS, EE and RP data sets. Before training we separate 25% of the data as out-of-domain test set test out dependent on a threshold on the regression target y to simulate an extrapolation problem. We use 50% as training data, 15% as test data test in and 10% as valiation data. We examine if it is possible to estimate the prediction uncertainty using established methods. We perform Monte-Carlo dropout at an ANN (Gal & Ghahramani, 2015). This method is based on applying dropout during the prediction phase to estimate the uncertainty of the prediction. In the case of TNNs we examine the standard deviations of the violations of the self-consistency conditions eq. 4, eq. 6, which includes the standard deviations of each single prediction. Finally, we compare the latent space distance (Tran et al., 2020a) to the training data of each prediction to its prediction uncertainty. In this case the projection into latent space is the output of the second last layer of the neural network.
The results of these examinations can be found in fig. 4. We differentiate between predictions on three different data sets: training set, in-domain test set test in and out-of-domain test set test out . A general observation is that the prediction error of each single sample has the tendency to be the smallest on the training set, higher on test in and even higher on the test out (table 2). We can also see that on the training set not a single method for the prediction uncertainty estimation is accurate. However, we find a strong correlation between some of the methods and the prediction error on the test sets test in and test out . There is no clear evidence that any of the prediction uncertainty estimation methods is uniformly better than any other, however, dropout seems to be worse than the other methods.
In our data sets, on the test sets there is empirical evidence that the prediction uncertainty can be modeled with func- where σ is the standard deviation d the latent space distance and a, α, b, β data set dependent parameters.
Further, we make a practical observation. As long as any of the predictors, applied to an unseen data point, is of the same magnitude as on the in-domain test set test in we expect the prediction uncertainty best estimated by RMSE testin . If any of the estimators is larger than that (as it is often the case on test out ), we expect the prediction error to be larger than the test error. This observation explains why the TNN self-consistency check can be employed as a proxy for decreasing prediction accuracy.

Conclusions
We have introduced twin neural network regression (TNN regression). It is based on the reformulation of a traditional regression problem into a prediction for the difference between two regression targets after which an ensemble is created by averaging the differences anchored at all training data points. This bears two advantages: a) the number of training data points increases quadratically over the original data set b) By anchoring each prediction at all training data points one obtains an ensemble of predictions of twice the training data set size. Although a straightforward comparison is difficult, compared to trajectory ensemble methods (Huang et al., 2017), which typically produce approximately ×5 snapshots per real training time equivalent, TNN regression produces on our data sets at least ×25 ensemble members. We have demontrated that TNN regression can compete and outperform traditional regression methods including ANN regression, random forests and xgboost.
Building an ensemble of TNNs can improve the predictive accuracy even further (table 1). TNNs are competitive with tree based methods on discrete datasets, however xgboost is significantly faster to train. On continuous data sets and on image based data sets TNNs are the clear winner. In the case where there is a large number of training data, TNNR might not see all possible pairs before convergence, such that it can't leverage its full advantage over traditional ANNs. In this case ANNs are able to compete with TNNR as shown in fig. 2. Since the anchor points during inference are only linear in the number of training data points sampling is normally not necessary. A successfully trained TNN must satisfy many self-consistency conditions. The violation of these conditions give rise to an uncertainty estimate of the final prediction ( fig. 4). Future directions include intelligently weighting the ensemble members by improved ensembling techniques. Some problems might benefit from exchanging the ensemble mean by the median. It would also be interesting to examine the ensemble diversity of TNN regression compared to other ensembles.

Datasets
We describe the properties of the data sets in table 3. In addition to the common reference data sets we included scientific data and image like data for regression. The random polynomial function (RP) is a data set created from the equation with random but fixed coefficients and zero noise.
The output in the RCL circuit current data set (RCL) is the current through an RCL circuit, modeled by the equation with added Gaussian noise of mean 0 and standard deviation 0.1.
The output of the Wheatstone Bridge voltage (WSB) is the measured voltage given by the equation with added Gaussian noise of mean 0 and standard deviation 0.1.

Hyperparameter Optimization
In this section we have included additional results of experiments on our data sets Boston Housing (BH), concrete strength (CS), energy efficiency (EE) and random polynomial function (RP). We have varied the L 2 regularization. We further exchanged our main optimizer adadelta by rmsprop. Finally, we examined the checkpoint that saves the weights based on the best performance on the validation set. We find that adadelta seems to give more consistent results. As long as the L 2 penalty is not too large, there is no clear evidence to favour a certain L 2 penalty over another. We further examined modifications to the architecture, dropout regularization, different learning rate schedules in preliminary experiments, which are not listed here, since none of these changes led to significant and uniform improvement.