Conformal Prediction with Orange

Conformal predictors estimate the reliability of outcomes made by supervised machine learning models. Instead of a point value, conformal prediction defines an outcome region that meets a user-specified reliability threshold. Provided that the data are independently and identically distributed, the user can control the level of the prediction errors and adjust it following the requirements of a given application. The quality of conformal predictions often depends on the choice of nonconformity estimate for a given machine learning method. To promote the selection of a successful approach, we have developed Orange3-Conformal, a Python library that provides a range of conformal prediction methods for classification and regression. The library also implements several nonconformity scores. It has a modular design and can be extended to add new conformal prediction methods and nonconformities.


Introduction
The application of supervised machine learning tools is growing, and an increasing number of decisions in a wide range of industries are based on computer-generated predictions. The effects of prediction errors vary between different applications. For example, forecasts of recommender systems that aim to increase sales in e-commerce are less critical than those of diagnostic systems in the intensive care unit. In applications where the reliability of individual predictions is crucial, quantification of the prediction confidence should complement the output of a machine learning method.
A range of current prediction confidence or reliability methods employs sensitivity analysis, inference of reliability from predictions of near neighbors, or explore properties of individual machine learning algorithms, such as tree variance in a random forest (Bosnić and Kononenko 2009). Conformal predictions (Vovk et al. 2005) are conceptually different from other reliability methods in predicting a range of values rather than reporting a confidence value. Provided that the data are independent and identically distributed (IID), the conformal prediction framework assures that the true value lies within the predicted range at a given significance level. Consequently, the user can control the level of prediction error and adjust it following the requirements of a given application.
The inputs to conformal prediction are a set of labeled examples and a nonconformity function that measures the strangeness of a labeled example. The output is a model that for a new example predicts a set of labels with low nonconformity. The nonconformity threshold is specified by the user through the desired significance level. For example, at significance level 0.05, a conformal predictor produces a prediction region that contains the true label with probability of at least 95%. Provided IID data sets, conformal prediction methods guarantee the ratio of correct predictions under the given significance level. The approach was introduced by Vovk et al. (2005) and presented through a tutorial published in the Journal of Machine Learning Research (Shafer and Vovk 2008). It found its use in practical fields such as chemoinformatics (Eklund et al. 2012) and molecular biology (Cortés-Ciriano et al. 2015).
The quality of conformal prediction depends on the choice of nonconformity function. For example, for classification (Shafer and Vovk 2008), nonconformity can be estimated by the predicted probability of the correct class, the difference of predicted probabilities of the two most likely classes, or the distance to the closest neighbors with the same class label. For regression ), a common nonconformity measure is an absolute error of a point prediction. A nonconformity threshold for a given significance is estimated from the distribution of nonconformity scores. A variety of sampling approaches, such as inductive, transductive and cross-validation methods have been proposed to estimate this distribution (Vovk 2015).
We present here Orange3-Conformal, a Python-based framework (van Rossum et al. 2011) for conformal prediction that supports various combinations of conformal prediction approaches and nonconformity scores. The library is an add-on to the Orange data mining framework (Demšar et al. 2013) and can also accommodate data and predictors from scikitlearn (Pedregosa et al. 2011). Several implemented approaches are specifically tailored to individual nonconformity scores, while others are general and can be used with any classifier or regressor. The library is designed to support applications of conformal prediction and facilitates research in novel calibration methods and nonconformities.

Conformal prediction
In this section we briefly introduce the theory of conformal predictions and describe the sampling methods and nonconformity measures that are implemented in the Orange3-Conformal library.

Conformal prediction theory
Conformal prediction theory is based on a nonconformal measure, which is a real-valued function F (B, z) that measures the difference between an example z and the examples in the bag B. The example z(x, y) consists of a feature vector x and an actual or assumed response y. For regression the response is a scalar, while for classification y can take any of the values in the set of n labels, such that y ∈ {l 1 , . . . , l n }. For classification, the nonconformal measure is computed for each possible class label. In the following, the nonconformal measures will be defined for an example z and its class value y, denoted by F (B, z) y .
Evaluation of the nonconformity measure for a given example and response results in a scalar nonconformity score (NCS). The p value over a set of NCSs estimates the probability that a given label y for an example z is accurate. Given the law of large numbers this probability is correct under the assumption of exchangeability (Vovk et al. 2005). In the application of a conformal predictor, a significance level is specified, yielding a prediction set (Γ ) including the true label with the probability 1 − .
For a binary classifier, the prediction set can include one label, both labels or no label, while regression conformal prediction defines the highest and lowest value of the response by identifying the y values that yield a NCS with a rank corresponding to the significance level . Hence, for a set of m NCSs, {α 1 , . . . , α m }, a p value for label y (Equation 1) corresponds to the fraction of NCSs greater than the NCS of the example being predicted (α * ). The label y is included in the prediction set Γ if and only if the p value is greater than the significance level (p value y > ; Shafer and Vovk 2008). This means that there is a significant amount of "stranger" examples, therefore this label is not so unusual and we should include it.
The quality of a conformal predictor model is quantified by its validity and efficiency. A conformal predictor is valid if the true label is within the prediction range for a fraction of at least 1 − of the examples in the validation set. For regression, the efficiency is described by the tightness of the prediction ranges. The tighter the interval the more efficient the conformal predictor. For classification, an individual prediction set can be empty, it can consist of a single label (singleton) or multiple labels. In an ideal case we would expect most of the predictions to be singletons. Therefore, the fraction of singletons is a feasible efficiency measure. A too low significance level or a too hard prediction problem will lead to more multiple predictions. Conversely, a too high significance level, which allows a low accuracy of the conformal predictor, can lead to empty predictions.

Sampling methods
NCSs are computed on a set of examples that are sampled by one of the three conceptually different approaches, denoted as transductive, inductive and cross conformal predictions. A transductive conformal predictor recalculates NCSs for all examples upon every prediction. Conversely, an inductive conformal predictor reuses a static set of NCSs for examples constituting a so-called calibration set, left outside of the training set. In cross conformal prediction, cross validation sampling is used to select multiple calibration sets, thereby increasing the number of available NCSs in comparison with the inductive approach. We provide some guidelines on the choice of the sampling method in Section 3.5. Inverse probability (InverseProbability) uses an underlying classification model and the probability (p y ) the model assigns to the class label (y). Hence, the nonconformity score for label y is defined as F (B, z) y = 1 − p y .

Nonconformity measures
Probability margin (ProbabilityMargin) is similar to InverseProbability but uses a normalized difference between the probability of the label (p y ) and the maximal probability of any other class.
SVM distance (SVMDistance) uses the distance in the space of the SVM model from the predicted example to the SVM's decision boundary.
KNN distance (KNNDistance) is only defined for binary classification problems and is expressed as the ratio between the sum of distances to the set of k nearest neighbors (N k (z) y ) that belong to the same class as the example z and the sum of distances to the set of neighbors in the opposite class (N k (z) label =y ).
KNN fraction (KNNFraction) identifies k nearest neighbors of the target data instance for which the NCS is calculated (N k (z)) and is expressed as a fraction of k nearest neighbors that share the class label y with the target data instance. Alternatively, the neighbors can be weighted by their distances to the target data instance.
Leave-one-out classification (LOOClassNC) measures the predictability of an instance from a set of its nearest neighbors. The value is normalized by the predictability of nearest neighbors. This is an experimental nonconformity score based on the study by Toplak et al. (2014).
Absolute error (AbsError) for regression models uses the difference between the predicted valueŷ and the response y of the example for which the NCS is calculated: F (B, z) y = |ŷ − y|.
Absolute error RF (AbsErrorRF) uses the standard deviation of the predictions of all constituting trees, σ RF , of an underlying random forest model. The standard deviation, together with a scaling factor, β, is used to normalize the absolute prediction error.
Error model (ErrorModelNC) is based on two underlying regressors. The first one is trained to predict the value of the example for which the NCS is calculated, while the second is used for predicting logarithms of the errors made by the first model, as described by Papadopoulos and Haralambous (2011).
Absolute error normalized (AbsErrorNormalized) uses an underlying regression model to predict the response of the target example, which is then normalized by distances and variances of the nearest neighbors as described by .
Leave-one-out regression (LOORegrNC) is analogous to the classification version. However, instead of using the predicted probability of the correct class, it relies on the absolute error of the prediction.
Absolute error KNN (AbsErrorKNN) computes the difference between the average response of the k nearest neighbors,ȳ, and the response of the example for which the NCS is calculated: F (B, z) y = |ȳ − y|. It is implemented with an average-and/or variancenormalization option based on the nearest neighbors.
Average error KNN (AvgErrorKNN) calculates an average absolute difference in responses between the example for which the NCS is calculated and the nearest neighbors:

The Orange3-Conformal package
We propose a Python-based implementation of conformal prediction. In the following sections, we compare it with related software packages, describe the architecture of Orange3-Conformal and provide an example of its use.

Related work
Several modular and extensible frameworks for conformal prediction have been proposed and vary both in a breadth of implemented functionality and in development activity. A Python library PyCP (Balasubramanian et al. 2013) implemented core techniques for classification, but the package has recently become unavailable. More functionality is provided in an R (R Core Team 2021) package conformal (Cortes 2016), which covers both classification and regression and was published as a side product of an application in proteomics (Cortés-Ciriano et al. 2015). The most extensive of the available libraries is the Python library nonconformist (Linusson 2018) that implements a range of conformity prediction approaches, but couples them with few nonconformity scores. Table 2 provides a brief comparison of these three libraries. Orange3-Conformal is conceptually most similar to nonconformist, but includes a richer set of parameterized nonconformity functions. Nonconformity functions should discriminate well between conformal and nonconformal data instances for a given outcome label (Shafer and Vovk 2008 Table 2: Comparison of functionalities implemented in existing conformal prediction libraries.
The first column for each library corresponds to classification and the second to regression. For example, PyCP implements a transductive approach for classification, but not for regression.
In rows, we report on the number of nonconformity scores that are available. In the second section, we indicate the availability of different approaches for the selection of calibration sets for computation of nonconformity scores (see Section 2.2). The last section provides information on documentation and availability of the software package. Notice that PyCP is no longer available. A mondrian setting is not applicable to regression and hence the corresponding entries are left blank.
is a critical component of conformal prediction. Their quality dictates the narrowness of the prediction range and with that the utility of the approach. Orange3-Conformal aims to cover a wide range of nonconformity approaches and to provide their implementation for further research on their differences and utility.
In most cases, the frameworks implement conformal prediction methods through an additional layer over underlying classification or regression models. With clearly defined approaches to nonconformity scoring, the main distinguishing factor of specific implementation is its usability. The nature of the overhead of the conformal prediction framework leaves very little room for speed improvements, which is why comparing the packages by the response time would rank them according to the speed of the employed package for machine learning.

Design
Orange3-Conformal wraps learners and classifiers from the Orange (Demšar et al. 2013) data mining suite. Besides being a pure Python-based library, Orange is essentially a visual programming toolbox that features a graphical user interface for the construction of interactive data analysis workflows (Curk et al. 2004). Orange3-Conformal is thus also a stepping stone towards interactive visualizations and exploratory analysis of conformal predictions.
Conformal predictions in Orange3-Conformal are obtained by selecting a combination of two main components: Conformal prediction method which can be transductive (TransductiveClassifier), inductive (InductiveClassifier and InductiveRegressor) or cross (CrossClassifier and CrossRegressor).
Nonconformity measure which is one of many provided measures of how unusual is a specific data instance. Orange3-Conformal includes general-purpose nonconformity measures like InverseProbability, ProbabilityMargin for classification, and AbsError, AbsErrorNormalized for regression. These measures work in combination with any underlying prediction model. There is also a set of measures that are predictor-specific and include, among others, SVMDistance, KNNFraction, AbsErrorRF, and AvgErrorKNN.
The library is modular with a range of implemented nonconformity measures, which can also serve as examples for those who wish to experiment and research other possible nonconformities. There are two core modules which implement methods for conformal classification (conformal.classification) and regression (conformal.regression). A separate module implements the nonconformity scores (conformal.nonconformity). Several evaluation scores were proposed to estimate the performance of conformal predictors (Lofstrom et al. 2013;. For classification, these also include confidence, credibility, and confusion. Module conformal.evaluation includes implementations of scoring functions for regression and classification, together with sampling-based evaluation procedures. The library is accompanied by a tutorial and reference documentation (http://orange3-conformal.readthedocs.io/).

Installation
Orange3-Conformal is an add-on for the Orange machine learning and data mining suite in Python and requires installation of Orange: • Use the installer from the official web page (http://orange.biolab.si/).
• With an existing Python installation and a C/C++ compiler (Windows users can get one at http://landinghub.visualstudio.com/visual-cpp-build-tools) use the pip package manager to install Orange3 from PyPI (Python Package Index): pip install Orange3 • Get the latest development version from the source code repository (https://github. com/biolab/orange3) and follow the instructions there.
After installation try to import the Orange library in Python: To install Orange3-Conformal, use pip that belongs to the Python instance used to run Orange: Finally, check that the installation was successful: >>> import orangecontrib.conformal as cp >>> print(cp.__version__) We will also set a random seed to make the demonstration replicable: >>> import numpy as np >>> np.random.seed(12345)

Classification
Here we demonstrate how to use an inductive conformal classifier with the nonconformity score computed as the inverse probability of logistic regression. We illustrate its use on the Iris data set from Anderson (1936). This classic data set describes 150 iris flowers with four features (sepal length, sepal width, petal length, petal width) and their species (Iris Setosa, Iris Versicolor, Iris Virginica), which is the class variable.
First, we need to import the libraries: >>> import Orange >>> import orangecontrib.conformal as cp Next, we load the data set, split it into a train, calibration and test set, and extract a single data instance for which we will make a conformal prediction: >>> iris = Orange.data. Table(  We can see that at the 0.1 significance level only the correct class of Iris-setosa was predicted (singleton prediction). By changing the significance level to 0.02, which implies a lower tolerance for errors, the model loosens its prediction to a multiple prediction with three possible classes, Iris-setosa, Iris-versicolor, and Iris-virginica.
There are 16 multiple predictions at 0.05 significance from the test set consisting of 75 instances. All these cases belong to the boundary region between the Iris-virginica and Iris-versicolor classes, while the instances from the Iris-setosa class are assigned a correct singleton prediction.  Figure 1 for a visualization of predictions made by the conformal predictor on the test set. As expected, the mistakes and non-singleton predictions appear on the boundary between different class values.

12
The list multi contains the necessary data about the image, predicted classes, and the true class from which we can visualize these uncertain examples as in Figure 3. By visual inspection of these examples, it is easy to agree with the uncertainty of the conformal predictions as all the proposed classes look likely. Interestingly, looking at the images from farther away, humans can determine the right digit in most of these cases.

Regression
Here we demonstrate the use of Orange3-Conformal on the regression problem of Boston house prices (https://archive.ics.uci.edu/ml/machine-learning-databases/housing/, Harrison and Rubinfeld 1978) and show the interaction with the scikit-learn library. The Boston house prices data set consists of 506 instances of Boston towns or suburbs each described by 13 features such as the crime rate per capita, average number of rooms per dwelling and pupil-teacher ratio by town. The target variable is the median value of owner-occupied homes in thousands of dollars (values range from 5 to 50).
The data set is available in scikit-learn, which we split into the training and testing set. We train a nearest neighbor regressor on the training set and evaluate it on the testing set.

5.59316
To use a conformal prediction method on the Boston housing data, we have to convert the scikit-learn's data set to Orange format. For conversion we do not have to specify the domain (use None) and let the library construct appropriate variables from the data and target arrays of the scikit-learn's data set. To use the constructed nearest neighbors regressor, we simply reshape the Orange's row instance to 2D and employ the previously constructed regressor from scikit-learn.
>>> import Orange >>> data = Orange.data. Now we are ready to set up a conformal predictor. As a nonconformity measure, we will use a non-normalized absolute error of 10 nearest neighbors determined by the Euclidean distance. To demonstrate another method, we will use this nonconformity in a cross conformal setting with 5 folds. Usually, we would employ a cross conformal instead of an inductive predictor in situations where the data set is too small for a fixed partition into separate training, calibration and testing sets. Predicting the test instance with stronger (smaller) significance levels results in wider prediction ranges. Note that, in this case, the ranges are centered on the same value that was predicted by scikit-learn's nearest neighbors regressor (22.74).

Nonconformity scores
In this section we use an artificial data set proposed in Friedman (1991) to illustrate the advantages of different nonconformity scores. The Friedman data set is available from the KEEL repository (http://keel.es/datasets.php, Alcalá-Fdez et al. 2011). It is defined by with ∼ N (0, 1) and where y is the dependent variable and x 1 , . . . , x 5 are the independent variables drawn from a uniform distribution over [0.0, 1.0] (Figure 4). N (0, 1) indicates a random noise from a zero-mean normal distribution with variance 1.
We experimented with six different nonconformity scores to demonstrate their differences. Refer to Section 2.3 for their descriptions. Some of them depend on underlying regressors for which we used a linear regression (LR) and a K-nearest neighbors (KNN) model (we used k = 20 in all such cases). We split the data set (1200 instances) into training (907 instances) and testing set (293 instances) in a 3:1 ratio. Because we used an inductive conformal regressor, we set aside one half (453 instances) of the training set for calibration. The output of conformal regressors for every instance is a range of predicted values. We observed the median width   of predicted ranges (median range) in the test set and an interdecile range, which is the difference in widths of predicted ranges of the first and ninth decile.
To observe the interdecile range, we performed a single experiment, although the accuracy of all nonconformities is slightly bellow 0.9, where we would expect it to be. Several repetitions of the experiment should remedy that but would invalidate the interdecile range measure. The AbsError method always outputs a range of the same width; therefore, the interdecile range is zero. AbsError based on the K-nearest neighbors model outperforms LR in terms of the median range. This is probably because the data set has a non-linear structure but is locally homogeneous, which benefits the KNN model. Other nonconformity scores output ranges of different widths to output wider ranges for difficult or unusual instances while keeping the ranges narrow in the other (normal) cases. AbsErrorNormalized, AvgErrorKNN and ErrorModelNC all use the nearest neighbors in some way and improve the median range results of the AbsError with LR. The best performing method regarding the median range is the new LOORegrNC nonconformity measure which is based on the predictability of an instance from its neighborhood, in our case using an LR model. A possible interpretation is that the method discovers non-homogeneous regions and successfully adapts the width of the predicted range. However, note that this comes at the cost of a larger interdecile range compared to some other nonconformities.

Best practice guidelines
Here, we propose some guidelines for applying the conformal prediction framework and the presented Orange3-Conformal package.
We first have to choose a conformal prediction method among transductive, inductive, and cross approaches for a given data set. The transductive prediction has the strongest theoretical foundation but is by far the most computationally demanding. In practice, we would use transduction on small data sets that include, for example, at most few hundred instances. The results of transductive conformal predictions are similar to the other two approaches unless an addition of a single labeled data instance can have a substantial impact on the trained model. On the other hand, the inductive method is the most practical due to its efficiency. It should be applied where the data set is sufficiently large to be split into a training and calibration set. In-between transductive and inductive approach is the cross prediction method, which we can use when the data set is too big for transductive prediction and too small for a sufficiently representative split into a training and calibration data set required by inductive prediction. The cross method reuses the training set similarly as the well-known cross validation approach in machine learning.
It is difficult to generalize about the success of various nonconformity scores, and insteadgiven a specific data set -a range of nonconformity methods should be evaluated and scored. For this reason, Orange3-Conformal implements many of the nonconformity methods from the current literature. These methods are conceptually different, rely on various properties of the data and thereby complement each other. The NCSs based on prediction error (AbsError) and model probabilities (InverseProbability) are the simplest and least computationally demanding, therefore providing a good starting point. Many of the methods use properties of the nearest neighbors such as responses, predicted values, and distances. Some NCS approaches develop local leave-one-out models over a broader set of nearest neighbors, increasing the computational time. The methods based on error models require an additional model of the full data set predicting the prediction error. While we suggest comparing all available nonconformity scores, the choice for inclusion may also depend on the available time and computational resources.
Our library may aspire to provide broader access to conformal predictions and multiple NCSs within the Python modeling community and its application might aid in future development of generalized guidelines for the selection of NCS. The application of the outlined best practices is illustrated by the case study in the following section, exploring the NCSs for classification.

Case study: AMES mutagenicity
Assessment of the carcinogenic potential of new molecular structures is vital in several industries, including those from chemistry, cosmetics, and pharmaceutics. Within the pharmaceutical industry, the tolerance for compounds with carcinogenic potential is very low, and it is crucial to identify such compounds early within the drug discovery process to terminate their development and reduce costs. For a long period of time, the AMES mutagenicity test has been the first in vitro assay routinely used within the pharmaceutical industry to gauge carcinogenicity. For an extensive review of this method refer to Mortelmans and Zeiger (2000). The widespread use of the AMES test has resulted in its standardization, promoted by the Organization for Economic Co-operation and Development (OECD) and the International Commission on Harmonization. The standardization improves the uniformity of test procedures between laboratories and enables a comparison of data originating from different laboratories.

Preliminaries
The AMES assay uses several strains of Salmonella bacteria to detect a range of genetic damage caused by chemical substances. The Salmonella strains included in the assay have preexisting mutations that prevent these strains from producing the amino acid histidine required for the bacteria to grow and form colonies. Chemicals with a tendency to cause mutations might revert this inability to produce histidine, which would be indicated by a promoted growth and colony formation compared to the negative controls. Prudent treatment of mutagenicity requires a compound to be categorized as mutagenic if it promotes growth in any of the Salmonella strains.
Because of the routine use of the AMES test within several industries across an extended period, a substantial amount of AMES data has accumulated in the public domain. Also, most larger pharmaceutical companies have AMES mutagenicity data sets from multiple projects, supporting the inference of predictive models of the outcome of the AMES in vitro assay.
A computational model that uses only the information about the molecular structure has the potential to filter out the compounds with an increased risk for mutagenicity before they are considered to be synthesized. The cost of synthesis and testing of drug candidates is substantial, and a model predicting AMES mutagenicity could not only save time and money but also increase the quality of the pursued compounds. However, false negative predictions could result in investments in poor compounds, while an incorrect positive prediction might rule out the next blockbuster drug. Because of the potential significant harm of wrong predictions, methods assessing not only the average accuracy of the model but also the confidence in individual predictions are highly valuable.
In drug discovery, new chemical entities are continuously explored, resulting in the violation of the fundamental requirement of IID data. In the quantitative structure-activity relationship (QSAR) literature, such compounds are considered outside of the so-called applicability domain of the model, and for such compounds, the validation statistics do not apply nor do model-based confidence prediction methods. Consequently, in drug discovery, a model-based prediction confidence method should be complemented by an assessment of the IID assumption between the training set and the test set. Once the IID assumption has been assessed, application of the conformal prediction framework, reliant on the IID criterion, provides additional information about the extent of the "strangeness" of the example. Furthermore, using the framework of conformal prediction assesses not only the prediction confidence but the error of the model can be controlled, facilitating risk analysis and planning in drug discovery projects. Hansen et al. (2009) have compiled an AMES mutagenicity data set from several public data sources, removing duplicates and compounds occurring in several data sets with contradictory classification. This data set contains the molecular structure of about 6500 compounds,  Table 4: Accuracy scores of classifiers on the AMES data set.
together with their AMES classification as positive or negative. The data set is reasonably balanced containing about 3500 positive and 3000 negative compounds.
Developing an empirical model based solely on information about chemical structure to predict a molecular property is referred to as QSAR modeling. As the chemical structure cannot be directly used for machine learning, it is instead represented numerically by a vector of so-called molecular descriptors. For an example of conformal prediction application to QSAR modeling, see Eklund et al. (2012). In our experiment, the set of 177 well established RD-Kit descriptors (Landrum 2006 were used as features to represent the molecular structure. This descriptor set contains the counts of various functional groups, molecular indices and calculated bulk properties such as lipophilicity and charge.

Experiments
We preprocessed the AMES data with QSAR features. We removed eleven features that were the same for all instances and one that contained spurious values. The resulting data set (available in the supplementary material) contains 6494 chemicals described by 165 features. We used an arbitrarily selected 80% (5220) of the instances for the training and the remaining 20% (1274) for the testing set.
To estimate the difficulty of the classification problem that we are presented with, we first evaluated several standard classification models (Table 4). We used the default parameters of each method. The observed values for each classifier are its classification accuracy (CA), the area under the ROC curve (AUC), precision, recall, and the F1 score. We can see that the classifiers obtain comparable results except for Naive Bayes, which under-performs. It relies on the independence of features, which does not seem to be the case in this problem. The obtained AUC values are also similar to those previously reported (Hansen et al. 2009); we attribute slight differences observed to different molecular descriptors and classifier parameters. From here on, we use logistic regression as our choice of a classification model, also because of its computational efficiency.
We will use the inductive conformal prediction approach, which requires a separate calibration set. For this purpose, we further split the training set in half into an actual training set (2610 instances) and a calibration set (2610 instances).
To confirm that the conformal predictor performs as expected, we construct a validation plot ( Figure 5). We measure the predictor's accuracy at several significance levels and check whether the observed error rate equals the specified significance level. Ideally, we would get a straight line with a slope of −1. For this purpose, we used an InductiveClassifier based on the InverseProbability nonconformity score of the underlying logistic regression model.
Besides validity, we are mostly interested in the efficiency of conformal predictors (Vovk et al.   2016). In classification problems, we can measure this as the number of singleton predictions (prediction sets of size one), the ratio of correct singleton predictions, number of multiple and empty predictions (Table 5). We can observe the same results for inverse probability and probability margin nonconformities. These two nonconformities are equivalent in binary classification problems. The measured accuracy corresponds well to the specified significance level (error rate) . As we increase , the number of singleton predictions grows, while the number of predictions containing multiple classes declines. There are almost no cases with an empty set of predicted classes, which would be assigned to instances too unusual for every possible class under the given significance level. The purpose of nonconformity measures is to distinguish between easy and difficult or unusual instances. The results for different nonconformity scores are quite similar in this case with the SVMDistance standing out slightly.
Because of the importance of correct predictions in drug discovery projects, high confidence in the predictions, at the expense of uninformative predictions, is often preferable. Selecting the 0.05 significance level and the KNNFraction method would result in 54.8% of the predictions being uninformative. However, the accuracy of the singleton predictions would be 89.1% compared to the classification accuracy of the logistic regression model that is 74.8%. Many drug discovery projects consider an accuracy level of 90% as sufficient and could discard about 55% of the compounds from expensive experimental testing. Table 6 presents conformal classifier's average confidence and credibility, which are measures that are independent of the specified significance level. Confidence of a single prediction is equal to 1 − 1 , where 1 represents the minimum significance level that would still result in a prediction of a single label. Hence, for KNNFraction nonconformity over the AMES test data set, we can exclude the least probable class with an average certainty of 91%. Similarly, the credibility of a prediction is the minimum that would result in an empty prediction set. Small credibility indicates an unusual example. For a complete introduction to confidence and credibility in conformal prediction refer to Saunders et al. (1999).

Conclusion
We have proposed a Python-based implementation of several conformal prediction methods. Conformal prediction is an interesting formal statistical approach that addresses the reliability problem in classification and regression, and has already been used in various applications in natural sciences. Compared to other software packages, Orange3-Conformal covers a wider range of nonconformity scores and is designed in a modular way to foster experimentation with new approaches and scoring techniques. Our implementation has purposely been implemented within the Orange data mining framework, opening a possibility of extending the implemented functionality with a graphical user interface and incorporating it with Orange's interactive data analytics framework.