Detecting adversarial manipulation using inductive Venn-ABERS predictors

Inductive Venn-ABERS predictors (IVAPs) are a type of probabilistic predictors with the theoretical guarantee that their predictions are perfectly calibrated. In this paper, we propose to exploit this calibration property for the detection of adversarial examples in binary classiﬁcation tasks. By rejecting predictions if the uncertainty of the IVAP is too high, we obtain an algorithm that is both accurate on the original test set and resistant to adversarial examples. This robustness is observed on adversarials for the underlying model as well as adversarials that were generated by taking the IVAP into account. The method appears to offer competitive robustness compared to the state-of-the-art in adversarial defense yet it is computationally much more tractable.


Introduction
Modern machine learning methods are able to achieve exceptional empirical performance in many classification tasks [1,2] . However, they usually give only point predictions : a typical machine learning algorithm for classification of, say, images from the ImageNet data set [3] will output only a single label given any image without reporting how accurate this prediction is. In complex domains such as medicine, it is highly desirable to have not just a prediction but a measure of confidence in this prediction as well. Many state-of-the-art machine learning methods provide some semblance of such a measure by transforming their inputs into vectors of probabilities over the different possible outputs and then selecting as the point prediction the output with the highest probability. Usually, these probabilities are obtained by first training a scoring classifier which assigns to each input a vector of scores, one score per class. These scores are then turned into probabilities via some method of calibration . The most popular calibration method is Platt's scaling [4] , which fits a logistic sigmoid to the scores of the classifiers.
Contrary to popular belief, however, the probability vectors that result from methods such as Platt's scaling cannot be reliably inter-preted as measures of confidence [5] . The phenomenon of adversarial manipulation illustrates this problem: for virtually all current machine learning classifiers, it is possible to construct adversarial examples [6] . These are inputs which clearly belong to a certain class according to human observers and which are highly similar to inputs on which the model performs very well. Despite this, the model misclassifies the adversarial example and assigns a very high probability to the resulting (incorrect) label. In fact, the inputs need not even be similar to any natural input; machine learning models will also sometimes classify unrecognizable inputs as belonging to certain classes with very high confidence [7] . Fig. 1 shows examples both of imperceptible adversarial perturbations as well as unrecognizable images which are classified as familiar objects with high confidence. These observations clearly undermine the idea that one can use such calibrated scores as measures of model confidence. It begs the question: Do there exist machine learning algorithms which yield provably valid measures of confidence in their predictions? If so, could these models be used to detect adversarial manipulation of input data?
The answer to the first question is known to be affirmative: the algorithms in question are called conformal predictors [8][9][10] . Our contribution here is to show that such valid confidence measures can in fact be used to detect adversarial examples. In particular, there exist methods which can turn any scoring classifier into a probabilistic predictor with validity guarantees; this construction is known as an inductive Venn-ABERS predictor or IVAP [11] . By making use of the confidence estimates output by the IVAPs, many state-of-the-art machine learning models can be made robust to adversarial manipulation.

Related work
The reliability of machine learning techniques in adversarial settings has been the subject of much research for a number of years already [12][13][14][15] . Early work in this field studied how a linear classifier for spam could be tricked by carefully crafted changes in the contents of spam e-mails, without significantly altering the readability of the messages. More recently, Szegedy et al. [14] showed that deep neural networks also suffer from this problem. Since this work, research interest in the phenomenon of ad-versarial examples has increased substantially and many attacks and defenses have been proposed [6,16,17] . The defenses can be broadly categorized as follows: • Detector methods [18][19][20][21] . These defenses construct a detector which augments the underlying classification model with the capability to detect whether an input is adversarial or not. If the detector signals a possible adversarial, the model prediction is considered unreliable on that instance and the classification result is flagged. • Denoising methods [22][23][24][25] . Here, the goal is to restore the adversarial examples to their original, uncorrupted versions and then perform classification on these cleaned samples. • Other methods [6,22,[26][27][28] . Another class of defenses performs neither explicit detection nor filtering. Rather, the aim is to make the model inherently more robust to manipulation via data augmentation, regularization, modified optimization algorithms or special architectures.
Despite the large number of defenses that have been proposed so far, at the time of this writing only one technique is generally accepted as having any noticable effect [29] : adversarial training [27,28] . However, even this method currently has too limited success. The Madry defense [27] , for instance, achieves less than 50% adversarial accuracy on the CIFAR-10 data set [30] even though state-of-the-art clean accuracy is over 95% [31] . Moreover, many recently proposed defenses suffer from a phenomenon called gradient masking [32] . Here, the defense protects the model against adversarials by obfuscating its gradient information. This is commonly done by introducing non-differentiable components such as randomization or JPEG compression. However, such tactics only render the model robust against gradient-based attacks which crucially rely on this information to succeed. Attacks that do not need gradients or that can cope with imprecise approximations (such as BPDA [32] or black-box attacks [33,34] ) will not be deterred by gradient masking defenses.
We are not the first to consider the hypothesis that adversarial examples are due to faulty confidence measures. Most existing research in this area has focused on utilizing the uncertainty estimates provided by Bayesian deep learning [35] , especially Monte Carlo dropout [36][37][38] and other stochastic regularization techniques [5] . On the other hand, our approach is decidedly frequentist and hence differs significantly from this related work. We are not aware of much other work in this area that is also frequentist (e.g. Grosse et al. [20] ), as the field of deep learning (and model uncertainty in particular) appears to be primarily Bayesian. Although significant progress is being made [39] , scaling the Bayesian methods to state-of-the-art deep neural networks remains an open problem. The Bayesian approach of integrating over weighted likelihoods also suffers from certain pathologies which may diminish its usefulness in the adversarial setting [40] . It is our hope that the method detailed here, which draws upon the powerful framework of conformal prediction [9] , can serve as a scalable and effective alternative to Bayesian deep learning.
The defense we propose in this work falls under the category of detector methods. We are well-aware of the bad track record that these methods have: for example, Carlini and Wagner [41] evaluate many recently proposed adversarial detector methods and find that they can be easily bypassed. However, by following the advice of Carlini and Wagner [41] and Athalye et al. [32] in the evaluation of our detector, we hope to show convincingly that our method has the potential to be a strong, scalable and relatively simple defense against adversarial manipulation. In particular, we test our detector on adversarial examples generated using existing attacks that take the detector into account as well as a novel white-box attack that is specifically tailored to our defense. We conclude that the defense we propose in this work remains robust even when the attacks are adapted to the defense and it appears to be competitive with the defense proposed by Madry et al. [27] .
The present work is an extension of our ESANN 2019 submission [42] . The main additions are as follows: • Include the Zeroes vs ones data set.
• Perform an ablation study to test the sensitivity of the IVAP to its hyperparameters. • Evaluate the defense in the 2 norm as well instead of only ∞ .
• The discussions of the results has been significantly extended.
• We have released a reference implementation on GitHub 1 .
• Include qualitative examples of adversarials produced by our white-box attack.

Organization
The rest of the paper is organised as follows. Section 2 provides the necessary background about supervised learning, conformal prediction and IVAPs which will be used in the sequel. Section 3 describes the defense mechanism which we propose for the detection of adversarial inputs. This defense is experimentally evaluated on several tasks in Section 4 . Experimental comparisons to the Madry defense are carried out in Section 5 . Section 6 contains the conclusion and avenues for future work. The code for our reference implementation can be found at https://github.com/saeyslab/binary-ivap

Background
We consider the typical supervised learning setup for classification. There is a measurable object space X and a measurable label space Y. We let Z = X × Y. There is an unknown probability measure P on Z which we aim to estimate. In particular, we have a class of X → Y functions H and a data set Our goal is to find a function in H which fits the data best: is the loss function . In principle, this can be any nonnegative X × Y × H → R function. Its purpose is to measure how "close" the function f is to the ground-truth y i locally at the point x i . Our objective is to find a minimizer f in H for the average of the loss over the data set S . For k -ary classification, a commonly used loss function is the log loss Here, p y ( x ; f ) is the probability that f assigns to class y for the input x . Ideally, for a discriminative model this should be equal to the conditional probability Pr [ y | x ] under the measure P . In the case of a scoring classifier , these probabilities are computed by fitting a scoring function g to the data followed by a calibration procedure s . The final classification is then computed as In the case of Platt's scaling [4] , which is in fact a logistic regression on the classifier scores, s i is the softmax function Here, w i and b i are learned parameters.

Conformal prediction
It is known [5,36] that the calibrated scores produced by scoring classifiers such as (1) cannot actually be used as reliable estimates of the conditional probability Pr [ y | x ] : it is possible to generate inputs for any given classifier such that the model consistently assigns high probability to the wrong classes. It makes sense, then, to look for a type of machine learning algorithm which has provable guarantees on the confidence of its predictions. This leads us naturally to the study of conformal predictors [8] , which hold exactly this promise.
In general, a conformal predictor is a function which, when given a confidence level ε ∈ [0, 1], a bag 2 of instances B = z 1 , . . . , z n and a new object x ∈ X , outputs a set ε (B, x ) ⊆ Y. Intuitively, this set contains those labels which the predictor believes with at least 1 − ε confidence could be the true label of the input sample based on the bag of examples given to it. These predictors must satisfy two properties [8,10] : 2. Exact validity. If the true label of x is y , then y ∈ ε ( B, x ) with probability at least 1 − ε.
The exact validity property is too much to ask from a deterministic predictor [8] . Instead, the algorithms we consider here are only conservatively valid . Specifically, let ω = z 1 , z 2 , . . . be an infinite sequence of samples from an exchangeable distribution P . A distribution is said to be exchangeable if for every sequence That is, each permutation of a sequence of samples from P is equally likely. In particular, i.i.d. samples are always exchangeable.
Define the error after seeing n samples at significance level ε as err ε n ( , ω) = takes as a parameter a non-conformity measure . In general, a non-conformity measure is any measurable real-valued function which takes a sequence of samples z 1 , . . . , z n along with an additional sample z and maps them to a non-conformity score ( z 1 , . . . , z n , z) . This score is intended to measure how much the sample z differs from the given bag of samples. The conformal prediction algorithm is always conservatively valid regardless of the choice of non-conformity measure 3 . However, the predictive efficiency of the algorithm -that is, the size of the prediction region ε ( B, x ) -can vary considerably with different choices for . If the non-conformity measure is chosen sufficiently poorly, the prediction regions may even be equal to the entirety of Y. Although this is clearly valid, it is useless from a practical point of view. Algorithm 1 determines a prediction region for a new input x ∈ X based on a bag of old samples by iterating over every label y ∈ Y and computing an associated p -value p y . This value is the empirical fraction of samples in the bag (including the new "virtual sample" ( x, y )) with a non-conformity score that is at least as large as the non-conformity score of ( x, y ). By thresholding these p -values we obtain a set of candidate labels y 1 , . . . , y t such that each possible combination (x, y 1 ) , . . . , (x, y t ) is "sufficiently conformal" to the old samples at the given level of confidence.

Inductive Venn-ABERS predictors
Of particular interest to us here will be the inductive Venn-ABERS predictors or IVAPs [11] . These are related to conformal predictors but they take advantage of the predictive efficiency of some other inductive learning rule (such as a neural network or support vector machine). The IVAP algorithm is shown in Algorithm 2 for obtaining a function f 1 .
the case of binary classification. The output of the IVAP algorithm is a pair ( p 0 , p 1 ) where 0 ≤ p 0 ≤ p 1 ≤ 1. These quantities can be interpreted as lower and upper bounds on the probability The width of this interval, p 1 − p 0 , can be used as a reliable measure of confidence in the prediction. Although Algorithm 2 can only be used for binary classification, it is possible to extend it to the multi-class setting [44] . This is left to future work. IVAPs are a variant of the conformal prediction algorithm where the non-conformity measure is based on an isotonic regression of the scores which the underlying scoring classifier assigns to the calibration data points as well as the new input to be classified. Isotonic (or monotonic) regression aims to fit a non-decreasing free-form line to a sequence of observations such that the line lies as close to these observations as possible. Fig. 2 shows an example of isotonic regression applied to a 2D toy data set. In the case of Algorithm 2 , the isotonic regression is performed as follows. Let s 1 , . . . , s k be the scores assigned to the calibration points. First, these points are sorted in increasing order and duplicates are removed, obtaining a sequence s 1 ≤ · · · ≤ s t . We then define the multiplicity of s j as The "average label" corresponding to some score s j is The cumulative sum diagram (CSD) is computed as the set of points For these points, the greatest convex minorant (GCM) is computed. Fig. 3 shows an example of a GCM computed for a given set of points in the plane. Formally, the GCM of a function f : U → R is the maximal convex function g : [46] . It can be thought of as the "lowest part" of the convex hull of the graph of f . The value at s i of the isotonic regression is now defined as the slope of the GCM between W i −1 and W i . That is, if f is the isotonic regression and g is the GCM, then We leave the values of f at other points besides the s i undefined, as we will never need them here.

Detecting adversarial manipulation using IVAPs
IVAPs enjoy a type of validity property which we detail here. A random variable P is said to be perfectly calibrated for another random variable Y if the following equality holds almost surely: (2) Let P 0 , P 1 be random variables representing the output of an IVAP trained on a set of i.i.d. samples and evaluated on a new random sample X with label Y ∈ {0, 1}. Then there exists a random variable S ∈ {0, 1} such that P S is perfectly calibrated for Y . Hence, for each prediction at least one of the probabilites P 0 , P 1 output by an IVAP is almost surely equal to the conditional expectation of the label given the probability. Since the label is binary, the relation (2) can be rewritten as Eq. (3) expresses the validity property for IVAPs. If the difference p 1 − p 0 is sufficiently small for some instance x , so that p 0 , p 1 ≈ p , then (3) allows us to deduce that x has label 1 with probability approximately p . The validity property (3) holds for IVAPs as long as the data are exchangeable, which is the case whenever the data are i.i.d.
Our proposed method for detecting adversarial examples is based on these observations: we use p 1 − p 0 as a confidence measure for the prediction returned by the underlying scoring classifier and then return the appropriate label based on these values. The pseudocode is shown in Algorithm 3 . The general idea is to use an IVAP to obtain the probabilities p 0 and p 1 for each test instance x . If these probabilities lie close enough to each other according to the precision parameter β, then we assume to have enough confidence in our prediction; otherwise, we return a special label REJECT signifying that we do not trust the output of the classifier on this particular sample. The precision parameter β is tuned by maximizing Youden's index [48] on a held-out validation set consisting of clean and adversarial samples. In case we trust our prediction, we use p = p 1 1 −p 0 + p 1 as an estimate of the probability that the label is 1 (as in Vovk et al. [11] ). If p > 0.5, we return 1; otherwise, we return 0. This method of quantifying uncertainty and rejecting unreliable predictions was in fact already mentioned in Vovk et al. [11] , but to our knowledge nobody has yet attempted to use it for protection against adversarial examples.
An important consideration in the deployment of Algorithm 3 is the computational complexity of the approach. This is clearly determined by the underlying machine learning model, as we are required to run its learning algorithm and perform inference with the resulting model for each sample in the calibration set as well as for each new test sample. The overhead caused by the IVAP Algorithm 3: Detecting adversarial manipulation with IVAPs. construction consists of the isotonic regressions, which are dominated by the complexity of sorting the samples in the calibration set, as well as other operations that take linear time. The scores of the calibration samples can be precomputed and stored for fast access later, reducing the complexity of the inference step. To summarize, let T A ( n ) and T F ( n ) denote the time complexity of running the learning algorithm and the scoring rule on a set of n samples, respectively. Then we can determine the time complexity of Algorithm 3 as follows for a bag of n samples: • Calibration . Initially, when the IVAP is calibrated, we run the learning algorithm on the proper training set and compute scores for each of the calibration samples. This takes • Inference . To compute the probabilities p 0 , p 1 for a new test sample, we run the scoring rule and perform two isotonic regressions. This can be done in time O(T F (1) + n log n ) .
The overhead incurred by the IVAP in the calibration phase (which only needs to be performed once) is proportional to the complexity of the learning algorithm and scoring rule when executed on the given bag of samples. When performing inference, the overhead compared to simply running the underlying model is log-linear in the size of the calibration set.
It is also important to consider how the calibration set and the underlying model affect the performance of the IVAP. To the best of our knowledge, the existing literature on IVAPs does not provide quantitative answers to these questions. Vovk et al. [11] note that the validity property (3) always holds for the IVAP regardless of the performance of the underlying model, as long as the calibration set consists of independent samples from the data distribution. They also point out that, for "large" calibration sets, the difference p 1 − p 0 will be small. It would be interesting to have uniform convergence bounds that quantify how quickly p 1 − p 0 converges to zero as the calibration set grows larger, but we are not aware of any such results. Most likely, such bounds will depend on the accuracy of the underlying model, where the convergence is faster with a more accurate model.

White-box attack for IVAPs
As pointed out by Carlini and Wagner [41] , it is not sufficient to demonstrate that a new defense is robust against existing attacks. To make any serious claims of adversarial robustness, we must also develop and test a white-box attack that was designed to specifically target models protected with our defense. Let x be any input that is not rejected and classified as y by the detector. Our attack must then find ˜ x ∈ X such that 1.
x −˜ x p is as small as possible; 2. ˜ x is not rejected by the detector; 3. ˜ x is classified as 1 − y .
Here, u p is the general p norm of the vector u . Common choices for p in the literature are p ∈ {1, 2, ∞ } [27] ; we will focus on 2 and ∞ norms in this work.
Following Carlini and Wagner [41] , we design a differentiable function that is minimized when ˜ x lies close to x , is not rejected by the detector and is classified as 1 − y . To do this efficiently, note that whenever a new sample has the same score as an old sample in the calibration set, it will have no effect on the isotonic regression and the result of applying Algorithm 3 to it will be the same as applying the algorithm to the old sample. Hence, we search for a sample ( s i , y i ) in the calibration set such that From among all samples that satisfy these conditions, we choose the one which minimizes the following expression: Here, s ( x ) is the score assigned to x by the classifier and c is a constant which trades off the relative importance of reducing the size of the perturbation vs fooling the detector. We choose these samples in this way so as to "warm-start" our attack: we seek a sample that lies as close as possible to the original x in X -space but also has a score s i which lies as close as possible to s ( x ). This way, we hope to minimize the amount of effort required to optimize our adversarial example.
Having fixed such a sample ( s i , y i ), we solve the following optimization problem: The constant c can be determined via binary search, as in the Carlini & Wagner attack [16] . In our case, we are applying the attack to a standard neural network without any non-differentiable components, meaning the score function s is differentiable. The resulting problem (4) can therefore be solved using standard gradient descent methods since the function to be minimized is differentiable as well. In particular, we use the Adam optimizer [49] to minimize our objective function. The constraint x + δ ∈ [0 , 1] d can easily be enforced by clipping the values of x + δ back into the [0, 1] d hypercube after each iteration of gradient descent. Note that our custom white-box attack will not suffer from any gradient masking introduced by our defense. Indeed, we only rely on gradients computed from the underlying machine learning model. As long as the unprotected classifier does not mask gradients (which it should not, since the unprotected classifiers will be vanilla neural networks trained in the standard way), this information will be useful.
As described, the above method of selecting a calibration sample and solving (4) only considers adversarials that are minimal in 2 distance. However, ∞ -bounded perturbations are also of interest, so in our experiments we evaluate both the original 2 formulation as well as an ∞ variant where we replace · 2 2 by · ∞ .

Experiments
To verify the robustness of Algorithm 3 against adversarial examples, we perform experiments with several machine learning classification tasks. The following questions are of interest to us here: Fig. 4. Illustration of the different data splits. The training data is split into proper training data on which the scoring classifier is trained and calibration data on which the IVAP fits the isotonic regression. The test data is split into proper test data, which is used to determine the overall accuracy of the resulting algorithm, and validation data. Both of these subsets of test data are used to generate adversarial proper test data and adversarial validation data. The adversarial proper test data is used for evaluating the robustness of the resulting algorithm, whereas the adversarial validation data is used together with the validation data to tune β. All splits are 80% for the left branch and 20% for the right branch. , it is not enough for an adversarial defense to be robust against an oblivious adversary; we must also establish robustness against an adversary that is aware of our defense. We therefore adapt existing adversarial attacks to target our detector and we also evaluate the IVAP against our custom white-box attack.

How much adversarial data is actually needed in the validation set ( Section 4.4 )? As it stands, we augment the validation set with adversarial examples generated by a variety of attacks in
order to tune the β parameter. However, we want our defense to be robust not only against attacks which it has been exposed to but also to attacks which have yet to be invented. We therefore compare our results to a setting where only one adversarial attack is used for the validation set.
The goal of these experiments is to show that the IVAP can handle both old and new adversarial examples and that its robustness generalizes beyond the limited set of attacks which it has seen. As such, we consider a number of different scenarios for each classification task, detailed below.

Robustness of the detector to the original adversarial examples
We run adversarial attack algorithms on both the validation set and the proper test set, creating an adversarial validation set and an adversarial proper test set (see Fig. 4 for details on our different data splits). The attacks we employed were projected gradient descent with random restarts [27] , DeepFool [50] , local search [51] , the single pixel attack [52] , NewtonFool [53] , fast gradient sign [6] and the momentum iterative method [54] . This means for each test set we generate seven new data sets consisting of all the adversarial examples the attacks were able to find on the test set for the original model. We then run Algorithm 3 on the adversarial validation set as well as the regular validation set. Our goal is to maximize Youden's index (defined below) on both the regular and adversarial validation sets. We choose β such that this index is maximized. We can then use the proper test set and the adversarial proper test set to judge the clean and adversarial accuracy of the resulting algorithm with the tuned value of β.

Robustness of the detector to new adversarial examples
We attempt to generate new adversarials which take the detector into account and evaluate the performance similarly to the first scenario. The resulting samples are called adapted adversarials . They are generated by applying all of the adversarial attacks mentioned above to Algorithm 3 . For the gradient-based attacks, we use the natural evolution strategies approach [34,55] to estimate the gradient, thereby avoiding any possible gradient masking issues [32] .

Fully white-box attack
We run our custom adversarial attack (4) for the detector on the entire proper test set and evaluate its performance. This attack is specifically designed to target the IVAP and should represent an absolute worst-case scenario for the defense.

Ablation study
We estimate β based on adversarials generated only by the DeepFool method on the validation set for the original model and then evaluate the resulting detector as before. The purpose of this scenario is to estimate how "future-proof" our method is, by evaluating its robustness against attacks that played no role in the tuning of β. Our choice for DeepFool was motivated by the observations that it is efficient and rather effective, but it is by no means the strongest attack in our collection. It is less efficient than fast gradient sign, but this method was never meant to be a serious attack; it was only meant to demonstrate excessive linearity of DNNs [6] . It is weaker than NewtonFool, for example, since this method utilizes second-order information of the model whereas DeepFool is limited to first-order approximations. It is also weaker than some first-order attacks such as the momentum iterative method, which was at the top of the NIPS 2017 adversarial attack competition [54,56] . We therefore feel that DeepFool is a good choice for evaluating how robust our defense is to future attacks.

Metrics
For all scenarios, we report the accuracy, true prositive rate (TPR), true negative rate (TNR), false positive rate (FPR) and false negative rate (FNR) of the detectors. These are defined as follows: Here, m is the total number of samples, TA is the number of correct predictions the detector accepted, TR is the number of incorrect predictions the detector rejected, FA is the number of incorrect predictions the detector accepted and FR is the number of correct predictions the detector rejected. Youden's index is defined as It is defined for every point on a ROC curve. Graphically, it corresponds to the distance between the ROC curve and the random chance line [48] . These metrics are computed for different scenarios: • Clean . This refers to the proper test data set, without any modifications. • Adversarial . Here, the metrics are computed on the adversarial test set. This data set is constructed by running existing adversarial attacks against the underlying neural network on the proper test set. They do not take the IVAP into account. • Adapted . This is similar to the Adversarial scenario but the attacks are modified to take the IVAP into account. • Custom p . Here, we compute the metrics for adversarial examples generated using our custom p white-box attack on the proper test set. We report results for p = 2 and p = ∞ .

Implementation details
The implementations of the adversarial attacks, including the gradient estimator using the natural evolution strategies, were provided by the Foolbox library [57] . The implementation of the Venn-ABERS predictor was provided by Toccaceli [58] . The different data splits we perform for these experiments are illustrated schematically in Figure 4 . Almost all neural networks were trained for 50 epochs using the Adam optimizer [49] with default parameters in the Keras framework [59] using the TensorFlow backend [60] . The exception is the CNN for the cats vs dogs task, which was trained for 100 epochs. No regularization or data augmentation schemes were used; the only preprocessing we perform is a normalization of the input pixels to the [0,1] range as well as resizing the images in one of the tasks. Descriptions of all the CNN architectures can be found in Appendix A .

Data sets
Here, we detail the different data sets used for our experiments. Note that these are all binary classification problems, since the IVAP construction as formulated by Vovk et al. [11] only works for binary tasks. Extending the IVAP to multiclass problems has been discussed, for example, in Manokhin [44] ; using a multiclass IVAP in adversarial defense will be explored in future work.
Zeroes vs ones. Our first task is a binary classification problem based on the MNIST data set [61] . We take the original MNIST data set and filter out only the images displaying either a zero or a one. The pixel values are normalized to lie in the interval [0,1]. We then run our experiments as described above, using a convolutional neural network (CNN) as the scoring classifier for Algorithm 3 .
Cats vs dogs. The second task we consider is the classification of cats and dogs in the Asirra data set [62] . This data set consists of 25,0 0 0 JPEG color images where half contain dogs and half contain cats. Again we train a CNN as the scoring classifier for Algorithm 3 . In the original collection, not all images were of the same size. In our experiments, we resized all images to 6 4 × 6 4 pixels and normalized the pixel values to [0,1] to facilitate processing by our machine learning pipeline.
T-shirts vs trousers. The third task is classifying whether a 28 × 28 grayscale image contains a picture of a T-shirt or a pair of trousers. Similarly to the MNIST zeroes vs ones task, we take the Fashion-MNIST data set [63] and filter out the pictures of Tshirts and trousers. Again, the pixel values are normalized to [0,1] and a CNN is used for this classification problem.
Airplanes vs automobiles. Our fourth task is based on the CIFAR-10 data set [30] , which consists of 60,0 0 0 RGB images 32 × 32 pixels in size. We filter out the images of airplanes and automobiles and train a CNN to distinguish these two classes.

Results and discussion
The baseline accuracies of the unprotected CNNs are given in Table 1 for the different tasks. We can see that they perform reasonably accurately on clean test data, but as expected the adversarial accuracy is very low. Table 2 gives the clean accuracies of the IVAPs for each of the tasks. From these results, we can see that the clean accuracy of the protected model invariably suffers from the addition of the IVAP. However, in almost all tasks, the false positive and false negative rates are relatively low. Our detectors are therefore capable of accepting correct predictions and rejecting mistaken ones from the underlying model on clean data. The exception is the cats vs dogs task, where the false positive rate is unusually high at 51.15%. However, we believe these results might be improved by using a better underlying CNN. In our initial experiments, we used the same CNN for this task as we did for airplanes vs automobiles and trained it for only 50 epochs. Using a more complex CNN and training it longer significantly improved the results, so we are optimistic that the results we obtained can be improved even further by adapting the underlying CNN and training regime.
The results presented here are observations from a single run of our experiments. Multiple runs of the experiments did not produce significantly different results.  Robustness to existing attacks. The first question we would like to answer when evaluating any novel adversarial defense is whether it can defend against existing attacks that were not adapted specifically to fool it. This is a bare minimum of strength one desires: if existing attacks can already break the defense, then it is useless. In Table 2 , we evaluate our IVAP defense against the  suite of attacks described in Section 4.1 . The results are shown for the Adversarial and Adapted scenarios. When we transfer adversarials generated for the underlying model to the IVAP, the accuracy always degrades significantly. However, it is important to note that all unprotected CNNs have very low accuracy on these samples. As such, in almost all cases, the IVAP has a very high true negative rate as most predictions are in error. The exception is the cats vs dogs task, where the true negative rate is rather low on the transferred adversarials. The accuracy is still much higher than the unprotected model, though (52.56% vs 1.8%), so the IVAP did significantly improve the robustness against this transfer attack. The adversarials generated using existing attacks adapted to the presence of the IVAP also degrade the accuracy, but not as severely as the transfer attack. True negative rates remain high across all tasks and the level of accuracy is, at the very least, much better than without IVAP protection.
Robustness to novel attacks. To more thoroughly evaluate the strength of our defense, it is necessary to also test it against an adaptive adversary utilizing an attack that is specifically designed to circumvent our defense. For this reason, Table 2 also shows two rows of test results for each task, where we run our custom white-box attack for both the 2 and ∞ norms. At first glance, it appears as though our attack is completely successful and fully breaks our defense, yielding 0% accuracy. To investigate this further, Figs. 6 and 7 show random selections of five adversarial examples that managed to fool the different detectors, generated by our white-box attack. Empirical cumulative distributions of the distortion levels introduced by our white-box attack are shown in Figs. 10 and 11 . Visually inspecting the generated adversarials and looking at the distributions of the distortion levels reveals cause for optimism: although some perturbations are still visually imperceptible, there is a significant increase in the overall distortion required to fool the detectors. This is especially the case for the ∞ adversarials, where almost all perturbations are obvious. Even the 2 adversarials are generally very unsubtle except for the ones generated on the cats vs dogs task. We believe this is due to the fact that the Asirra images are larger than the CIFAR-10 images (64 × 64 vs 32 × 32), so that the 2 perturbations can be more "spread out" across the pixels. Regardless, if one compares the adversarial images to the originals, the perturbations are still obvious because the adversarials are unusually grainy and blurred.

Sensitivity of hyperparameters.
For the experiments carried out in Table 2 , the IVAP was only exposed to adversarial attacks which it had already seen: the rejection threshold β was tuned on samples perturbed using the same adversarial attacks as it was then tested against. Fig. 5 shows the ROC curves for the different detectors along with the optimal thresholds. An important question that arises now is how well the IVAP fares if we test it against attacks that differ from the ones used to tune its hyperparameters, since in practice it is doubtful that we will be able to anticipate what attacks an adversary may use. Moreover, even if we can anticipate this, the set of possible attacks may be so large as to make it computationally infeasible to use them all. Therefore, we also test the performance of the IVAP when β is tuned on adversarials generated only by the DeepFool attack instead of the full suite of attacks. We obtained identical values of β for the Zeroes vs ones and Cats vs dogs tasks. The results for the other tasks where β was different are shown in Table 3 . The conclusions are similar: the accuracy degrades both on clean and adversarial data, but clean accuracy is still high and true negative rates are high on adversarial data. The custom white-box attack again appears highly successful. Figs. 12 and 13 plot the empirical cumulative distributions of the adversarial distortion levels for the ablated detectors and Figs. 8 and 9 show selections of custom white-box adversarials which fooled the ablated detectors. These results show that robustness to our 2 white-box attack is significantly lower compared to the non-ablated detectors. The ∞ -bounded perturbations are still very noticable, however. As an illustrative example, Fig. 14 shows histograms of the differences p 1 − p 0 between the upper and lower probabilities for clean and DeepFool adversarial examples on the zeroes vs ones task. Although there is a small amount of overlap, the vast majority of adversarial examples have a much higher difference than clean samples: clean data has a difference close to 0%, whereas adversarials are almost always close to 50%. The tuned model threshold is placed so that the majority of clean samples are still allowed through but virtually all adversarials are rejected.

Comparison with other methods
In this section, we compare our IVAP method to the Madry defense [27] which, to the best of our knowledge, is the strongest state-of-the-art defense at the time of this writing. Specifically, we apply the Madry defense to each of the CNNs used in the tasks from Section 4 and compare the accuracies of the different methods. We also perform transfer attacks where we try to fool our IVAP using adversarials generated for the Madry defense and vice versa.
The Madry defense is a form of adversarial training where, at each iteration of the learning algorithm, the ∞ projected gradient descent attack is used to perturb all samples in the current minibatch 4 . The model is then updated based on this perturbed batch instead of the original. To facilitate fair comparison, we do not use the pre-trained models and weights published by Madry et al. for the MNIST and CIFAR-10 data sets, since these are meant for the full 10-class problems whereas our defense only considers a simpler binary subproblem. Rather, we modified the network architectures for these tasks so they correspond to our own CNNs from Section 4 and trained them according to the procedure outlined by Madry et al. [27] . Table 4 shows the results we obtained with the Madry defense as well as the parameter settings we used for the PGD attack. The    Table 4 Summary of parameters and performance indicators for the Madry defense on the machine learning tasks we considered.
Step sizes were always set to 0.01 and random start was used each time. defense performs very well on the zeroes vs ones and T-shirts vs trousers tasks. However, on airplanes vs automobiles the adversarial accuracy is low and on the cats vs dogs task both the clean and adversarial accuracies are low. In fact, for the latter task, the adversarial accuracy is almost the same as if no defense was applied at all.   In Table 6 we transfer the white-box adversarials for our defense to the Madry defense. The Madry defense appears quite robust to our IVAP adversarials for T-shirts vs trousers and airplanes vs automobiles, but much less so on zeroes vs ones. The accuracy Table 6 Results of the IVAP-to-Madry transfer attack. Here, adversarial examples generated for the IVAP defense by our custom white-box attack are transferred to the Madry defense. We test adversarials generated by both the 2 and ∞ variants of our attack.

Task
Accuracy (  on the adversarials for the cats vs dogs task is close to the clean accuracy, so these appear to have little effect on the Madry defense. Table 5 shows what happens when we transfer adversarials for the Madry defense to our IVAP detectors. We observe that for the zeroes vs ones and airplanes vs automobiles, the accuracy drops below 50%. However, false positive rates on these adversarials are very low, so the detectors successfully reject many incorrect predictions. On the other tasks, accuracy remains relatively high. The false positive rate is rather high for the cats vs dogs task, but low for T-shirts vs trousers. Fig. 15 plots the empirical distributions of the adversarial distortion levels of the Madry adversarials, both for the 2 and ∞  distances. Comparing this figure with Figs. 10 and 11 , we see that our custom white-box adversarials for the IVAP defense generally require higher 2 and ∞ levels of distortion than the Madry adversarials. Fig. 16 plots selections of adversarials generated for the Madry defenses by the ∞ PGD attack. The visual levels of distortion appear comparable to our defense (for the ∞ variant of our attack), except in the case of the cats vs dogs task where the perturbations for the Madry defense seem much larger. It is important to note, however, that the performance of the Madry defense on this task is close to random guessing, whereas the IVAP still obtains over 70% clean accuracy.
We are lead to the conclusion that our defense in fact achieves higher robustness on these tasks than the Madry defense. Moreover, our defense appears to scale much better: adversarially training a deep CNN with ∞ projected gradient descent quickly becomes computationally intractable as the data increase in dimensionality and the CNN increases in complexity. On the other hand, our IVAP defense can take an existing CNN (trained using efficient, standard methods) and quickly augment it with uncertainty estimates based on a calibration set and a data set of adversarials for the underlying model. The resulting algorithm appears to require higher levels of distortion than the Madry defense in order to be fooled effectively and it seems to achieve higher clean and adversarial accuracy on more realistic tasks. Finally, we note that our method introduces only a single extra scalar parameter, the threshold β, which can be easily manually tuned if desired. It is also possible to tune β based on other objectives than Youden's index which was used here. On the other hand, the Madry defense requires a potentially lengthy adversarial training procedure whenever the defense's parameters are modified.

Conclusion
We have proposed using inductive Venn-ABERS predictors (IVAPs) to protect machine learning models against adversarial manipulation of input data. Our defense uses the width of the uncertainty interval produced by the IVAP as a measure of confidence in the prediction of the underlying model, where the prediction is rejected in case this interval is too wide. The acceptable width is a hyperparameter of the algorithm which can be estimated using a validation set. The resulting algorithm is no longer vulnerable to the original adversarial examples that fooled the unprotected model and attacks which take the detector into account have very limited success. We do not believe this success to be due to gradient masking, as the defense remains robust even when subjected to attacks that do not suffer from this phenomenon. Our method appears to be competitive to the defense proposed by Madry et al. [27] , which is state-of-the-art at the time of this writing.
The algorithm proposed here is limited to binary classification problems. However, there exist methods to extend the IVAP to the multiclass setting [44] . Our future work will focus on using these techniques to generalize our detector to multiclass classification problems.
Of course, the mere fact that our own white-box attack fails against the IVAP defense does not constitute hard proof that our method is successful. We therefore invite the community to scrutinize our defense and to attempt to develop stronger attacks against it. Our code is available at https://github.com/saeyslab/binary-ivap . The performance of the IVAP defense is still not ideal at this stage since clean accuracy is noticeably reduced. However, we believe these findings represent a significant step forward. As such, we suggest that the community further look into methods from the field of conformal prediction in order to achieve adversarial robustness at scale. To our knowledge, we are the first to apply these methods to this problem, although the idea has been mentioned elsewhere already [64,Section 9.3,p133] .

Declaration of Competing Interest
The authors declare that they have no known competing financialinterestsor personal relationships that could have appeared to influence the work reported in this paper.