Binary classification of imbalanced datasets using conformal prediction

https://doi.org/10.1016/j.jmgm.2017.01.008Get rights and content

Highlights

  • Conformal Prediction finds a majority of active compounds in highly imbalanced data.

  • Separate distributions are used for the two classes during classification.

  • No balancing measures is need for imbalanced data using Conformal Prediction.

Abstract

Aggregated Conformal Prediction is used as an effective alternative to other, more complicated and/or ambiguous methods involving various balancing measures when modelling severely imbalanced datasets. Additional explicit balancing measures other than those already apart of the Conformal Prediction framework are shown not to be required. The Aggregated Conformal Prediction procedure appears to be a promising approach for severely imbalanced datasets in order to retrieve a large majority of active minority class compounds while avoiding information loss or distortion.

Introduction

Quantitative structure-activity (QSAR) analysis has been used for many years in a variety of different application areas. The basic function of QSAR is to quantitatively couple a chemical structure description to a given physical behavior, such as a physical or pharmacological property. Considerable progress has been made during the last 10 years with respect to better, more informative, structural descriptions as well as improved statistical techniques to relate chemical structural descriptions to a given property [1], [2]. The resulting quantitative structure-activity (QSAR) models have considerable potential for reducing experimental costs either by refinement or replacement of experiments. QSAR models are increasingly interpretable which aids in improving the general understanding of the physical and chemical properties underlying a particular phenomenon. Two different types of QSAR models have emerged over the past decade with respect to interpretability and transparency of the derived models. The first type is closer to “classical” QSAR where the model is less complex and interpretable in terms of method and/or descriptors used for the modelling. The second type focuses on larger and chemically diverse datasets using statistical methods from the machine learning domain resulting in more opaque and complex models. Fujita and Winkler have in a recent paper discussed these two schools of QSAR practices and pointed out the utility as well as advantages of both approaches and how they can work in concert in order to provide better and more useful QSAR models for the future [3].

For many important endpoints the datasets investigated are imbalanced with respect to the distribution between active and inactive compounds. This frequently represents a challenge when modelling these datasets accurately because often the minority group is the active group and thus of greater interest. A sampling of just over 1700 PubChem datasets with more than 50 actives and 1000 compounds screened in total suggests that close to 70% have a class distribution (active:inactive) that is over 1:10 and 75% have a class distribution that is over 1:5. For datasets, more representative for screening purposes, with more the 5000 tested compounds these proportions have increased to 94 and 97%, respectively (Fig. 1).

These data suggest that as chemical libraries increase in size, and as our capacity to generate a greater amount of data based on these chemical libraries increases, so will our need to adapt our QSAR methods to account for the distributions of these emerging datasets in order to effectively learn from them and to build useful predictive models that do not over-represent one particular class over the other. This is particularly important in toxicologically-relevant datasets in which statistical anomalies can give rise to false negative predictions that poses a significant problem in the form of potentially erroneous risk assessments. Thus our goal in this study is both to capture the information in the entire dataset while ensuring that the risk of making errors in prediction is both minimized and known.

There have been many proposed methods presented over the years to address the problem of imbalanced datasets in order to achieve acceptable quality with respect both false negatives as well as false positives for the minority and majority class, respectively. Some of these approaches involve over- and/or undersampling of the classes, e.g. SMOTE [4], [5], [6], with or without weighting of classes [7]. Various loss or cost functions to penalize errors [8], [9], [10] have also been used to handle class imbalances as well as techniques such as boosting or bagging [11], [12]. Ensemble learning using resampled, sometimes more balanced, training sets has also been proposed to obtain better model performance [13]. Other methods that have been used include descriptor selection by selecting descriptors separately for the minority and majority classes, respectively, [9] or by using backpropagation [14]. On a more algorithmic level, class balancing has been attempted using particle swarm optimization for subset selection of training sets [15]. An interesting, somewhat different, approach for handling class imbalances has recently been proposed by Kondratovich et al. [16] by using transductive modeling, i e. transductive support vector machines as an alternative to the more commonly used inductive techniques described above. Recently Clark and co-workers have proposed the utilization of beta binomial distributions for improving models quality including models quality for imbalanced datasets [17].

In this work we would like to introduce Conformal Prediction [18] as a way of handling class imbalances without the need for more explicit balancing measures, e.g. over- and/or undersampling, subsetting or threshold modifications that may introduce complications such as information loss.

Section snippets

Datasets and structure standardization

  • a

    Hansen: The Ames mutagenicity data set was taken from Hansen et al. [19]. Two studies using 10 as well as 20% randomly selected training sets were used.

  • b

    aid1851: The high-throughput screening data on CYP2D6 inhibition was obtained from PubChem AID 1851 [National Institutes of Health: AID 1851 – PubChem BioAssay Summary. PubChem Bioassay 1851, http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=1851; accessed July 9th 2015]. The training sets were randomly selected as 10% of the dataset.

  • c

    aid2796:

Assessment of the performance of active predictions

The inspection of Fig. 3a–h reveals a good balance within the predicted single classes’ active and inactive where a clear majority of both classes are correctly predicted, i e. there are relatively few active compounds predicted as inactive and vice versa. It is also interesting to note how well the active class compounds are predicted regardless of class distribution for the dataset in question even for the extremely unbalanced aid493091 (3 h). The high retrieval rate is also manifested by the

Conclusions

Aggregated Conformal Prediction seems to be an effective alternative to other, more complicated and/or ambiguous methods involving various balancing measures when modelling severely imbalanced datasets without the need for additional explicit balancing measures other than those already apart of the Conformal Prediction framework. The Mondrian approach, i e. using different distributions for the binary classes, seems to be advantageous when correctly identifying and retrieving compounds

Acknowledgments

The research at Swetox was supported by Stockholm County Council, Knut & Alice Wallenberg Foundation, and Swedish Research Council FORMAS.

References (29)

  • S. Kotsiantis et al.

    Handling imbalanced datasets: a review

    GESTS Int. Trans. Comput. Sci. Eng.

    (2006)
  • J.M. Garcia-Gomez et al.

    Definition of loss functions for learning from imbalanced data to minimize evaluation metrics

  • H. Parvin et al.

    A new imbalanced learning and dictions tree method for Breast cancer diagnosis

    J. Bionanosci.

    (2013)
  • H. Wang et al.

    Large unbalanced credit scoring using lasso-logistic regression ensemble

    PLoS One

    (2015)
  • Cited by (48)

    • Ensemble prediction of mitochondrial toxicity using machine learning technology

      2021, Computational Toxicology
      Citation Excerpt :

      Conformal prediction provides a rigorous and mathematically proven framework for modeling with guarantees on error rates as well as a consistent handling of the models’ applicability domain intrinsically linked to the underlying machine learning model. Conformal predictors link each prediction with a measure of statistically valid confidence [33,34]. For qualitative endpoints like the one studied here, the method can yield as prediction result: positive, negative, or uncertain.

    • Predicting With Confidence: Using Conformal Prediction in Drug Discovery

      2021, Journal of Pharmaceutical Sciences
      Citation Excerpt :

      In Mondrian conformal prediction, the prediction for each class is estimated separately using an individual calibration set per class. This has been shown to work well even for severely imbalanced datasets,10,11 without the need for additional balancing techniques. In binary classification (active/inactive classes) there exists four outcomes for conformal prediction:

    View all citing articles on Scopus
    View full text