The Difference Between the Accuracy of Real and the Corresponding Random Model is a Useful Parameter for Validation of Two-State Classification Model Quality

The simplest and the most commonly used measure for assess the classification model quality is parameter Q2 = 100 (p + n) / N (%) named the classification accuracy, p, n and N are the total numbers of correctly predicted compounds in the first and in the second class, and the total number of elements of classes (compounds) in data set, respectively. Moreover, the most probable accuracy that can be obtained by a random model is calculated for two-state model by the formulae Q2,rnd = 100 [(p + u) (p + o) + (n + u) (n + o)] / N2 (%), where u and o are the total number of under-predictions (when class 1 is predicted by the model as class 2) and over-predictions (when class 2 is predicted by the model as class 1) in data set, respectively. Finally, the difference between these two parameter ΔQ2 = Q2 – Q2,rnd is introduced, and it is suggested to compute and give ΔQ2 for each two-state classification model to assess its contribution over the accuracy of the corresponding random model. When data set is ideally balanced having the same numbers of elements in both classes, the two-state classification problem is the most difficult with maximal Q2 = 100 % and Q2,rnd = 50 %, giving the maximal ΔQ2 = 50 %. The usefulness of ΔQ2 parameter is illustrated in comparative analysis on two-class classification models from literature for prediction of secondary structure of membrane proteins and on several quantitative structure-property models. Real contributions of these models over the random level of accuracy is calculated, and their ΔQ2 values are compared mutually and with the value of ΔQ2 (= 50 %) for the most difficult two-state classification model.


INTRODUCTION
N a two-state classification modeling one wants to develop a model for selected molecular property or activity (y-variable) using one or more input molecular attributes (descriptors, i.e. x-variables) which, for a molecule and to a certain accuracy, correctly estimates or predicts its property or activity class.
In estimating quality of two-state models the parameter Q2 can be used, which is named as the classification accuracy (in %), [1] or as the percentage of all correct predictions. [2] The parameter Q2 is the percent of correctly classified elements of the first (p) and of the second class (n) in the set having N elements belonging to one of two classes. If one reports Q2 value for a two-state classification model of 90 % (or 95 %), it seems that the model is impressively accurate. However, the real level of model accuracy can be estimated if that Q2 value is compared with the accuracy that can be obtained by a random model (Q2,rnd). It is evident that, in above mentioned case (i.e. Q2 = 90 %) the real model contribution is significantly different if the most probable random accuracy is Q2,rnd = 50 %, or if it is Q2,rnd = 70 %. For each model, and also for structure-property models related to small molecules or proteins, it is possible to calculate (or to estimate by simulations) the level of accuracy which can be obtained by a random model which uses I randomized original data (variables), or purely random data (variables). When the model real accuracy (estimated by a statistical parameter) has been reported, another important value that has to be given is the value of the same statistical parameter for the corresponding random model which provides information on the level of chance (random) accuracy. In such a case two classical works that addressed this topic in analysis of correlations between variables are those by Topliss et al. published forty years ago for multivariate linear regression models. [3,4] It is obvious that some level of random correlation is present between each pair of variables and it is also demonstrated in that papers on the randomly generated variables. [3,4] Additionally, the analysis of several real models were performed and one recommendation (later often used in chemical structure-property modeling) is given about the maximal acceptable number of variables in the Multivariate Linear Regression (MLR) models. Namely, the authors estimated that the number of variables/descriptors involved in MLR models should not exceed 1/5 of the number of cases (molecules) used in data set. [4] Random correlation (or accuracy) is higher for real than for random pairs of variables, because real variables have (typically) more monotonous distribution of values than the random ones. In addition, real variables have, as a rule, a real common background relation to some basic properties of constituting elements of data set. In case of data sets of chemical compounds or proteins used in modeling of activities, properties, or structural characteristics of proteins (like secondary structure or topology of membrane proteins), molecular descriptors derived from chemical structure are commonly related to basic properties of compounds (e.g. molecular weight, size, shape, the number of specific atoms, the number of bonds) or proteins (e.g. sequence length, the total number of some specific amino acid types, percentage of a secondary structure). Thus, to access the real level of random accuracy (or correlation) of a model, one must ensure that generated random data used in simulations have structure and distribution similar to those of real input data.
We present here the analysis and estimate of random accuracy for two-state classification problems, and compare real and random accuracies on several data sets related to (1) the modeling and prediction of secondary structure of membrane proteins, based on their primary structure, and (b) two-class properties of small molecules from the field of Quantitative-Structure-Activity Relationships (QSAR). On examples of real data sets we will analyse the influence of balance of numbers of elements in two classes (in experimental input data and in estimated / predicted data) on the random accuracy expressed by the Q2 parameter.

Definition of Secondary Structure of Membrane Proteins
In most simple classification problem only two classes of experimental properties or activities are defined. Secondary structure of membrane proteins is mostly determined by the parts of sequence interacting with membrane, that are in the secondary structure alpha (forming alpha-helix) or beta (forming beta-barrel), and the rest of sequence is usually considered (taken) to be in irregular secondary structure. In this study, we will validate two-state classification models on data sets of alpha-type (i.e. helical) integral membrane proteins, the largest class of membrane proteins. Namely, it is assumed that 20-30 % of sequenced genomes code for helical membrane proteins, but there are less than 1.2 % (~ 1370 proteins) of solved structures of helical or beta membrane proteins among ~ 120000 known (experimentally solved) protein structures deposited in Protein Data Bank. [5] For secondary structure of alfa-type membrane proteins it is commonly to define two classes of secondary structure of amino acids in protein sequence: (1) alpha secondary structure, containing one or more transmembrane segment(s) each consisting of (mostly) 19-21 neighbouring amino acids that form integral membrane alpha helix denoted by M, and (2) extra-membrane parts having secondary structure that is named as undefined, denoted by U. Simplified scheme of experimental secondary structure of a membrane protein having 100 amino acid residues (amino acid in primary structure is designated by '-') and one transmembrane segment of 20 amino acids in primary structure is given in Scheme 1.

Contingency Table Definition
Comparing real (experimental) and predicted structures from Scheme 1 we can define the following parameters: • p = positive correct prediction (real M predicted as M) = 15, (underlined both in real and predicted structures) • u = underprediction (real M predicted as U) = 5 • n = negative correct prediction (real U predicted as U) = 75 • o = overprediction (real U predicted as M) = 5.
It is evident that p + u + n + o = N = 100 amino acids, and that p + u = n(M) = 20, and n + o = n(U) = 80. In this case we say that the prediction done by the model is balanced, because the model predicts the same numbers of M and U states (classes) as it is in experimental sequence. The prediction quality of two-class model can be also described by 2 × 2 contingency table given in Table 1.
An ideal model would be the one with u = 0 and o = 0 (N = p + n), when all elements in both classes are correctly predicted.

Balanced Data Sets and Balanced Models
Real two-class data sets usually have different numbers of elements in both classes. However, in some cases it is possible to create an ideally balanced experimental data set with the same number of elements of classes (p + u = n + o). If possible, it is desirable to use balanced experimental data set in model development and optimization, because in that case both classes are equivalently treated during the model training, and one expects that characteristics of both classes will be evenly memorized by the model, i.e. evenly represented by the model's parameters. Another case is balanced set in estimation (or prediction), i.e. when the same numbers of classes are estimated by the model (p + o = n + u), and, in that case, it is not necessary that a totally balanced experimental set was used for model training.
Model balanced in estimation or prediction is the third concept introduced and used for models that conserve in estimation the total numbers of classes from experimental set which is used for model training. In that case we have both p + u = p + o and n + o = n + u, what gives u = o. However, u and o do not need to be equal to zero, and it can be p + u ≠ n + o (for experimental set) or p + o ≠ n + u (for estimation of classes). Thus, a well performed modeling will normally end after the model achieves the balance between u and o in estimation on the training (or validation) set, and only in the case when u = o it is possible to reach (in an ideal case) the maximal possible classification accuracy Q2 of 100 %.

Real and Random Accuracies of a Model
Starting from contingency table, different statistical parameters are defined, used (and also cited) in scientific literature in estimating the model accuracy. [2,6] Parameter Q2 [Eq. (1)], related to classification accuracy of a real model, is the simplest one that can be calculated from the contingency table:

A)
Experimental protein secondary structure scheme

UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
The total numbers of amino acids predicted by the model to be in states U and M are 80 and 20, respectively. Scheme 1. Simplified view of experimental membrane protein secondary structure and secondary structure estimated or predicted by a balanced model. If a random model predicts (p + o) amino acids to be in class M for (p + u) experimentally determined amino acids in class M, and (n + u) = N -(p + o) amino acids to be in class U for (n + o) experimentally determined amino acids in class U, then the random accuracy Q2,rnd can be estimated as: or shortly Also, this is the most probable value of Q2 parameter that can be obtained by any random model. If the experimental secondary structure of a protein (or a data set of proteins as a whole) contains the same number of M and U states/classes, then p + u = n + o = N / 2 and we can factorize (p + u) in the numerator of [Eq. (3) This means that the random value of Q2 parameter is Q2,rnd = 50 %, for data with ideal balance of numbers of M and U states in experimental structure, regardless how big or small is the ratio between the numbers of M and U classes estimated (or predicted) by a model. Note that this also holds (Q2,rnd = 50 %) for ideally balanced estimation (or prediction) by the model, i.e. when p + o = n + u = N/2, regardless how large or small is the ratio of numbers of M and U classes in experimental structure. If one obtains a balanced model which estimates (or predicts) in secondary structure of proteins the same numbers of states/classes M and U as in the experimental structure (p + u = p + o and n + o = n + u), then Q2,rnd from [Eq.
Equation (5) enable us to estimate the most probable random accuracy for balanced model that one plans to develop, and in that case Q2,rnd can be calculated only using experimental numbers of classes, i.e. here p + u for state M (class 1) and n + o for state U (class 2). It follows from [Eq. (5)] that for balanced model the minimal value of Q2,rnd is 50 %, when both classes are equally represented in experimental data set (p + u = n + o).

The Difference Between Real and Random Accuracies of a Model
Finally, the difference (in %) between the real model accuracy Q2 and the corresponding random accuracy Q2,rnd is calculated by [Eq. (6)]: This value has its maximum of (Q2)max = 50 % for balanced model estimation or prediction (u = o) when: a) the maximal value of Q2 = 100 %, and b) the experimental data set is balanced having the same numbers of both classes (M and U for proteins, or class 1 and class 2, for general twostate QSAR model) when Q2,rnd = 50 %. At the same time, the balanced model developed on such an experimental set of data is the most difficult problem for modeling (and analogous to the coin-tossing problem repeated N times, where N = p + n + u + o). Thus, the maximal range (Q2 difference) for development and optimization of a model (i.e. our 'algorithm' is guessing) from the random level to the maximal level is 50 %. Any real two state classification model developed on the imbalanced experimental (training) data set with different total numbers of elements of two classes will have the difference Q2 between the real and random Q2 accuracies smaller than 50 %.

RESULTS
Assume that for a sequence like the one from Scheme 1 (N = 100) with the 2 × 2 contingency table given in Table 1 Knowing that the most difficult two-state classification problem having equal number of both classes in experimental set has Q2,rnd = 50 %, and that an ideally balanced model has the maximal classification accuracy Q2 of 100 %, the maximal possible contribution of such a model Q2 is 50 % (Eq. (6)). Based on it, one can see that the maximal possible contribution of the model from Scheme 1 is significantly lower than it is for the most difficult two-state problem for which Q2 = 50 %. The real model contribution is from the random level of 68 % (which is primarily defined by the class imbalance in the experimental set used for model development because class M has 20 % and class U has 80 % elements), to the classification accuracy of Q2.= 90 % obtained by the model estimation.

Analysis of Real and Random Accuracies of Models for Prediction of Membrane Proteins' Secondary Structure
We analysed random accuracies in several sets of membrane proteins from literature. In case when it was not possible to find p, n, o and u values for model predictions in literature (or calculate them from data given in published paper), but we had information on experimentally determined secondary structure, e.g. p + u and N (the total number of amino acids), we calculate Q2,rnd by [Eq. (5)] assuming an ideal case, i.e. that the model is balanced (p + u = p + o and, consequently, n + o = n + u). In cases of balanced estimation/prediction by a model, we denoted Q2,rnd as Q2,rnd-bal and used [Eq. (5)].
From Table 2 one can see that the most probable random classification accuracy for selected real data sets varies from 54 % to (even) 64 %, indicating a remarkable imbalance of the numbers of elements/states M and U of classes in experimental set. Values of Q2,rnd-bal (= Q2,rnd) from Table 2 can be reduced to some extent by balancing data set of membrane proteins. Because the numbers of positive (p) and negative (n) correct predictions are not separately reported in analysed manuscripts, [7][8][9] it was not possible to calculate neither Q2 nor Q2 parameters. In any case, the contributions of models do not need to be counted as they starts from Q2,rnd = 50 % but from Q2,rnd-bal, which is higher than 50 % for each of models presented in Table 2.
Analyses given in Table 3 include seven data sets from different versions of two methods (SPLIT [10,11] and TopPred_G [13] ) developed for prediction of secondary structure of membrane proteins.
The differences between real and random accuracies Q2 for methods and data sets from Table 3 are in the range between 24.5 % and 35 %, and are significantly lower than

Analysis of Real and Random Accuracy of Classification Quantitative Structure-Activity Models
Quantitative structure-property / activity classification models have been often developed and applied in different subfields of chemistry like drug-design, environmental, physical or material chemistry. In Table 4 we give a set of twostate classification models from drug-design [14][15][16][17][18] and one from environmental science. [1] One can see from Table 4 that the average contribution of developed models (i.e. computational methods) over the level of the most probable random accuracy (Q2,rnd) measured by the ΔQ2 parameter is higher than for previous models related to structure of membrane proteins ( Table 3, the average of ΔQ2 values is 30.2 %), and range from 26.9 % for model no. 6 to 47.2 % for model no. 12, with an average of 35.6 %. The main reason for higher ΔQ2 values could be ascribed to possibility of selection of more balanced data sets with much closer numbers of elements of classes (high or low activities) in the field of QSAR modeling comparing with membrane protein data sets, in which the total number of elements of structure (class) U is considerably larger than for class M (compare p + u and n + o values in Tables 2 and 3). This disbalance in the numbers of M and U secondary structure states is defined by the length and by the nature of membrane protein sequences which contain (usually) more U than M secondary structure states, and in creating data set only complete sequences have to be selected (i.e. we cannot take only a part of a sequence into the data set, but the sequence as a whole).
For more balanced data sets Q2,rnd decreases, and from [Eq. (6)] it follows that ΔQ2 will increase. This is also confirmed by the values of Q2,rnd-bal from Table 4, a parameter actually calculated by [Eq. (5)] from squares of frequencies of class 1 and class 2, which will be the lowest and equal to 50 % if frequencies of both classes are equal. Finally, the average of Q2,rnd-bal values from Table 4 is 53 %, and is lower that Q2,rnd-bal averages from It should be stressed that this comparative analysis of magnitudes of ΔQ2 parameters for different models does not suggest anything about the level of significance of these models (per se). Namely, ΔQ2 parameter calculated as the difference of two parameters (Q2,rnd) and Q2 will be more significant if each of two quality parameters used for its Table 4. Real and random classification accuracy and their differences (all in %) of data sets of compounds used for development of quantitative structure-activity two-state classification models. (a)  calculation will be more significant. The significance of a model (and also significance of model quality parameters) is primarily defined by the relation between the size of data sets (i.e. by the number of elements) used for training and by the number of optimized model parameters. Taking this into account one can conclude that models for prediction of structure of membrane proteins from Tables 2 and 3, which are based on much larger data sets, seem to be more significant than QSAR models from

CONCLUSION
Presented results show that the accuracy that can be obtained by a random model, is determined, to a large extent, (1) by the ratio of numbers of elements belonging to each of two classes in experimental input data (i.e. (p + u) / (n + o)), and (2) by the corresponding ratio of numbers of elements in two classes (i.e. (p + o) / (n + u)) estimated or predicted by the model. In both cases, optimal value is equal to 1, i.e. when both classes are equally populated (balanced). Finally, the balanced model for prediction of classes is the model which estimates or predicts the total numbers of elements in classes to be (almost) the same as those in experimental data, and, only such a model can reach (ideally) a maximal accuracy of 100 %. For analysed models ΔQ2 values were mostly between