Integral Sign Change Problem Check in Quantitative Structure-Activity / Property Relationships : A Tutorial

Sign change problem (SCP) in multivariate Quantitative Structure-Activity/Property Relationships (QSAR/QSPR) is the inconsistency in the direction of association between molecular descriptors and the dependent variable. Sign change is observed when the signs of the elements of the reference vector (correlation vector for the data set obtained from variable selection) are compared to the signs of the elements of all correlation and regression vectors related to a model. SCP check in this work, named Integral SCP (ISCP) check, is established to be a general effective anti-SCP procedure, consisting of five check levels. Twelve diverse QSAR/QSPR data sets from literature were tested, and performance of data sets, models and descriptors was assessed by qualitative labeling systems. Most data sets and models did not have satisfactory performance, what is discussed in terms possible data and model remedy.


INTRODUCTION
Quantitative Structure-Activity Relationship (QSAR) and Quantitative Structure-Property Relationship (QSPR) [1][2][3][4][5] are usually a multivariate (rarely a univariate) regression equation by which a macroscopic property of interest y in vector form y, usually a measured biological activity (in QSAR) or physico-chemical property (in QSPR) of n chemicals, is modeled from m ≥ 2 molecular descriptors which form a matrix X.A multivariate regression equation obtained from Multiple Linear Regression (MLR), Partial Least Squares (PLS) regression or other multivariate regression method, 1,2,4,6,7 has a general form in which the vector of the predicted property y ( ŷ ) is calculated from descriptors x j (vectors x j ), after regression coefficients α and β j have been determined: Selected independent variables x j i.e. molecular descriptor, have important features in relation to y, such as interpretability, easy generation for future applications, and statistically significant correlation to y, which can be expressed via simple linear regression for the j-th descriptor: ˆ, 1 ,2, , Regression coefficients are α j = 0 and β j = r j when the data are autoscaled.The Pearson correlation coefficient r j for the j-th descriptor is a statistical index, which measures the degree and direction of the association of variables x j and y.[10][11][12][13][14] One can notice for a univariate regression (Equation 2) that sign ( ) sign ( ) and, naturally expects for a multivariate model that sign ( ) sign ( ) i.e., the signs of regression coefficients from the simple (univariate) and multivariate regression equations are the same for a particular descriptor.In general, it is expected that the signs of all regression and correlation coefficients for the j-th descriptor are preserved i.e., they are equal to that of r j , regardless of data set used (complete, training, external validation or other set).
Sign change problem (SCP) or lack of internal inconsistency in a multivariate QSAR/QSPR modeling has been reported and discussed 15,16 as the lack of preservation of these signs.In other words, a descriptor undergoes sign changes when the signs of its correlation and regression coefficients for studied data sets are different from the sign of the respective correlation coefficient r j for the data set that was obtained from variable selection (reference data set).A multivariate regression model is considered free of SCP when there is no sign change for all of its descriptors.The essence of the SCP check is to ensure that the j-th molecular descriptor x j in a QSPR/QSAR model is physically realistic, i.e. it is based on real properties of pure substances.Descriptors must have statistically defined direction of correlation in all linear regressions, i.e. the correlation is either positive or negative.Thus, when x j is increasing, y must either increase or decrease, it is not possible that both trends exist at the same time.This is the natural imperative for chemical problems such as synthesis of new compounds, selection among existing compounds, docking procedures and intermolecular interaction studies.In the period from 2007 to 2009, the author of this study participated in a line of research on QSAR/QSPR model validation, 4,14,[17][18][19] during which the sign change problem has been identified as a serious obstacle in obtaining chemically realistic regression models.As a natural consequence, a simple SCP check 15 was introduced in 2010 to be a tool for rapid SCP detection and elimination, and then extended to a more advanced SCP check 16 in 2012.6][27][28] In this work, the previous SCP check 16 is substantially extended to Integral SCP (ISCP) check.The aim of the ISCP check is to detect false or partially deficient regression models in terms of descriptors' sign changes (in further text: the sign changes), and to identify useless or ill-constructed data sets which were used to build such models, and remedy the models and data sets whenever possible.ISCP check consists of five levels, which are introduced to become a general and effective anti-SCP tool, once standard model validations are not efficient to identify, eliminate or minimize the sign changes. 15For this purpose, ISCP check tutorial is given, twelve examples of diverse QSAR/QSPR models and data sets are tested, and the resulting performance is discussed with possible anti-SCP remedy.

General Remarks
If a research includes two or more reference regression data sets and models, then the ISCP procedure must be applied for each reference data set and model.
ISCP check should be performed between variable selection and model validations.Based on the performance of data set and model of interest in the ISCP procedure, researcher decides either to proceed to model validation or go back to variable selection and eventual data set modification.
For the ISCP procedure, one should provide the maximum possible number of data sets and models in relation to the model of interest.If the complete data set or split data were not used, they should be employed to build models, regardless that they could not be of the primary interest.More data sets and models based on the same variable selection give better insight into the structure of data and quality of modeling.
Level 1 -Simple SCP Check ISCP check level 1 was the first SCP check, 15,16 carried out for fifty-two QSAR/QSPR data sets.It consists of calculating correlation vectors of descriptor -y relationships for all data sets, as well as regression vectors of the respective regression models, and comparing all these vectors to the correlation vector for the data set that resulted from variable selection.For example, if the model and the data set obtained from variable selection include all samples (n samples), then these are the reference model and the reference data set, respectively, yielding one correlation and one regression vector.Posterior data split into training set (n t samples) and external validation set (n e samples) produces a new regression vector and correlation vector for the training set, whilst the correlation vector for the external validation set is calculated when the set is sufficiently large (for example, having seven or more samples 15 ).Then, all correlation and regression vectors are compared to the reference correlation vector element-by-element, including the regression vector of the reference model.If the complete data set was first split, variable selection was carried out and the regression model was built, then the data set and model are the reference ones, and the correlation vector is the reference vector.All other regression and correlation vectors are compared to this one in terms of the signs of their elements, meaning that complete comparison is made for each descriptor.The number or count of sign changes with respect to the reference vector gives the sign change absolute frequency, which has to be zero for a QSAR/QSPR model of acceptable performance.
Level 2 -Full SCP Check ISCP check level 2 has been carried out for three QSPR data sets and described in details previously. 16The full SCP check consists of calculating regression vectors for all submodels, where a submodel is any model having two or more (at most m) selected descriptors.The obtained vectors are compared to the reference vector in the same way as in the simple SCP check.The idea of this check is to confirm the internal consistency of the model that was obtained from variable selection, the reference model: the model with m descriptors is always decomposed into submodels with zero sign change frequency.For m selected descriptors, the numbers of all combinations giving bivariate, trivariate, etc. l-variate models up to the final m-variate model are binomial coefficients, which can be calculated as l-combinations (l > 2) for m elements, or simply used as elements of the (m + 1)-th row of Pascal's triangle, with exception that the first two elements of this row must be discarded.Therefore, the number of all tested multivariate models is 2 m -m -1, obviously growing predominantly exponentially with m, and meaning that chances for sign changes are greatly augmented for complex regression models.Table 1 shows how the number of multivariate submodels, number of regression coefficients and the number of the coefficients per descriptor, grow with m.It is easy to test all submodels in case of MLR, but when a more complex regression method is in question, carrying out the same computational procedure for all submodels can be tedious.For example, PLS requires determination of the optimal number of latent variables for all submodels.

Level 3 -Extended SCP Check
The idea of this new check is to extend the full SCP check to a larger set of descriptors and so, to obtain statistically more reliable report on SCP performance the model of interest and of selected descriptors.This goal can be achieved using a descriptor pool (level 3a) or its well-defined subset (level 3b).For this purpose, two descriptor sets are formed at each level: set of selected descriptors (X S ), and the set containing the pool or its subset with exclusion of selected descriptors (X P ).Then, the extended SCP check is carried out for each selected descriptor to determine its respective sign change frequency.The extended SCP check for a particular descriptor consists of building MLR models: bivariate models for all variables in X P , then trivariate models for all combinations of two variables from X P etc.For example, using information from Table 1, one can calculate binomial coefficients so that for X P with m P descriptors there will be m P bivariate models, m P (m P -1) / 2 trivariate and m P (m P -1)(m P -2) / 6 tetravariate models for each selected descriptor from X S .Descriptor sign change frequency is the count of sign changes in regression coefficients for a particular selected descriptor, and not for descriptors from X P .The complexity of MLR models in the procedure is determined by limiting factors: large number m P , too long calculation time, computational and memory limits, and the impossibility to build all MLR models for certain descriptor combinations (high multicollinearity, descriptors with mostly constant values etc.).The reference model has satisfactory performance in extended ISCP check when its sign change frequency is very small.

Level 4 -Randomization SCP Check
Many published QSAR/QSPR models cannot be wellchecked at previous ISCP levels, especially when the number of calculated descriptors is small, there are no available data for all ISCP check levels (external validation set, descriptor pool or its subset), the final regression equation is univariate, and there is only one regression model published, among other difficulties.The idea of this novel SCP check for a model built for n samples is to overcome the difficulties, by using descriptors obtained in a large number of random permutations of

Level 5 -t-and F-tests
In a previous study 15 independent variables in QSAR/QSPR were divided into noise or "trash" variables and descriptors, depending whether their absolute values of correlation coefficients with respect to y were smaller or greater than the empirical threshold 0.30, respectively.To this criterion a new one is added in this work, t-and F-tests for determination of statistical sigificance of descriptor -y linear relationships (Equation 2), motivated by the fact that several QSAR/QSPR studies do not incorporate it in usual data set analysis. 15t consists of two levels: 1) level 5a -t-test for α j , which is rarely used test in QSAR/QSPR; and 2) level 5b, t-test for β j and F-test, which are two tests not so rarely used in QSAR/QSPR.Although mathematically equivalent, t-and F-tests do not always give exactly the same results because of differences in propagation of calculation errors.In this work, qualitative labeling system for statistical significance at 95 % confidence interval, based on p-values, was adopted from GraphPad statistical software. 29Variables characterized as having not statistically significant (NSS: p > 0.10) or not quite statistically significant (NQSS: 0.05 < p < 0.10) relationship to y are considered noise variables, whilst variables with statistically significant (SS: 0.01 < p < 0.05), very statistically significant (VSS: 0.001 < p < 0.01) and extremely statistically significant (ESS: p < 0.001) relationship to y are considered descriptors.At level 5a, QSAR/QSPR model has poor performance if at least significant fraction of variables fails in this test, whilst the level 5b is a more rigorous criterion.It is important to emphasize that all models i.e., data used for all models inspected, must be checked at level 5.This ISCP check enables identification of noise variables, 15 which can undergo sign changes (unstable noise appears due to data splitting or modeling) or can be stable (hidden and real noise, with significant and not significant contribution to the model, respectively).ISCP check level 5 can also point out descriptors undergoing sign change due to data splitting (quasi descriptor), whilst descriptors with sign changes (anti descriptors) are idetified at all SCP check levels 1-5.By applying the complete ISCP procedure, identified good (real) descriptors are ready to be used in further QSAR/QSPR analysis, and deficient (quasi and anti) descriptors can be eventually remedied with other data splitting, outlier removal, or descriptor transformation.

Additional Checks
Three additional checks are recommended joint to the ISCP procedure.First, it is the check whether there are descriptors with very small regression coefficients (Equation 1).Such descriptors have no significant contribution to the multivariate model.When using autoscaled data, β j values are scale-independent.
Other very important test is graphical inspection of bivariate descriptor -y plots for all data sets to see whether there are problems with variable distribution, outliers and distinct groups of samples, non-linearity, and artificially high correlations, among others. 15In general, graphical inspection of relationships between variables is not less important than numerical checks. 30,31hird check is for descriptors whose absolute values of correlation coefficients relative to y are significantly smaller than the threshold of 0.30, regardless of the results from the ISCP check levels 5a and 5b.It is a practical aid in QSAR/QSPR research to eliminate falsely relevant descriptors. 14,15rformance Qualitative Labeling Systems ISCP check is a complex tool that provides calculation of several statistical indices for descriptors, models and regression coefficients.Usage of these indices is not very simple and therefore, performance qualitative labeling systems are proposed.Because the labeling systems can be well explained when used for concrete examples, i.e. results from ISCP check levels, appropriate places are dedicated to the labeling systems in section Results and discussion.The performance qualitative labeling systems can be used in a QSAR/QSPR research, to compare various models and see which one has the best performance, in the same way as is shown for twelve data sets and models in the present work.

Data Sets and Regression Models
Data Set A This medium-size QSPR data set was published by Kiralj and Ferreira. 17It consists of five electronic (E e , E CC , Δ HL , Q C2mul , Q Omul ) and three structural (σ b , σ r , D CC ) descriptors.Two PLS models were constructed and used to predict δ, the 17 O carbonyl chemical shift in substituted benzaldehydes.A PLS model with 50 samples resulted from variable selection, and another model was based on posterior data split, with 40 and 10 samples in the training and external validation set, respectively.Descriptor pool contained 109 variables that were based on chemical knowledge of heteroaromatic compounds.An initial subset of 51 descriptors had absolute values of correlation coefficients for descriptor -y relationships (|r j |, Equation 2) greater than 0.60, which was the basis for the ISCP check levels 3a and 3b with 101 and 43 descriptors in the matrix X P , respectively.The two PLS models were validated previously by various methods, 14 and inspected by the simple SCP check for the reference model, for which no sign change has been observed. 15In this work, ISCP check levels 2-5 were applied to data set A and the models, and more comprehensive analysis of the descriptors and models in terms of sign changes is reported.

Data Set B
This small QSPR data set for 23 samples (polycyclic aromatic hydrocarbons, PAHs, including benzene) was published by Ferreira, 32 consisting of four descriptors: electronic (EA), steric (SArea) and topological (Log(W), X e ).Two PLS models, model 4 from variable selection and its externally validated analogue with 16 training samples, 32 were built with the purpose to predict boiling points T b of PAHs.Model 4 was later checked by simple and full SCP checks, 16 by which no sign change has been detected.Descriptor pool consisted of 14 variables, generated from chemical knowledge of PAHs, meaning that matrix X P contained 10 descriptors.Therefore, the ISCP check levels 3a, 4 and 5 were performed in this work.

Data Set C
This is a QSPR-type (more exactly: LFER, Linear Free Energy Relationship) data set of moderate size (64 samples and five descriptors), which was published by Sprunger et al., 33 and used to build an MLR model (model from Equation 10) 33 to predict LogP x,CTAB/water , logarithm of micellar phase-water partition coefficient of diverse solutes.Descriptors were rationally generated according to LFER theory and therefore, no additional variables existed in the descriptor pool: electronic: (E, S), steric (V) and hydrogen bonding properties (A, B) of solutes.4 and used to predict logBCF, logarithm of bioconcentration factor of diverse nonpolar organic compounds.The reference MLR model (model 2, Equation 7) 34 and its externally validated analogue (Equation 8, with 85 training and 29 external samples) 34 were constructed with very rudimentary validation.The simple and full SCP checks were carried out by Kiralj,16 revealing the presence of sign changes in the models.Other ISCP check levels were applied in this work, but since there were no data available for the electrotopological descriptor pool (15 descriptors), 34 the ISCP check level 3 could not be performed.
Data Set E This is a small QSAR-related (more exactly: QSAAR, Quantitative Structure-Activity-Activity Relationship) data set, containing three independent variables: electronic (LUMO) and constitutional (N O ) molecular descriptors, and biological activity (Human liver) for 23 diverse toxic chemicals.It was published by Lessigiarska et al. 35 and used to build an MLR model (model 8) to predict logarithm of human toxicity (HAP).The model had only rudimentary validation.Data split into 18 training and 5 external validation samples was made by Kiralj and Ferreira, 15 and a new MLR model was constructed and validated by the simple SCP check, by which sign change was not detected but a hidden noise variable was identified.Descriptor pool 35 had more than 250 variables, and only a part of it was published.In this work, the ISCP check level 3b with only 19 descriptors in X P , and check levels 4 and 5 were carried out.

Data Set F
This small QSAR data set consists of four electrotopo-logical descriptors of the MEDV type (x 1 , x 7 , x 29 , x 52 ) for 21 samples (cyclooxygenase-2 inhibitors), and was published by Liu et al., 36 and used to build two MLR models: one based on variable selection (Equation 3), and the other one as its externally validated analogue model (Equation 4: 15 and 6 samples in the training and external validation sets, respectively). 36The models were validated by certain methods, and used to predict 50 % drug activity in logarithmic form, pIC 50 .Descriptor pool of 91 electrotopological descriptors and its subsets were not published and therefore, in this work only the ISCP check level 3 could not be carried out.
Data Set G This is a larger QSAR data set for 153 polar narcotics from the phenol class, with 50 % toxic activity against T. pyriformis in logarithmic form (log1/IGC 50 ), as published by Aptula et al. 37 for classification purposes.It contains five molecular descriptors: electronic (E homo , E lumo ) and constitutional (N hdon ) descriptors, lipophilicity (logK ow ) and basicity (pK a ).The first MLR model with certain model validation and modest external validation was published by Yao et al. 38 More rigorous external validation was carried out by Kiralj and Ferreira, 14 by building two more models, with 75 and 78 samples in the training and external sets and vice versa (roles of the sets were exchanged), and validating with other standard model validations.The simple SCP check for the data set with 75 samples and the respective MLR model 15 has revealed the presence of sign changes, as well as problematic descriptors with no statistically significant relationship to the dependent variable.The original descriptor pool, consisting of seven descriptors generated from chemical knowledge of phenols, was not published. 37In this work, ISCP check levels 2, 4 and 5 were carried out.

Data Set H
This QSAR data set is small, containing four descriptors (steric: M 04 and M 11 ; shadow: S 6 '; and shadowstructural: P 5X ) for 21 oral progesterones with progestational activity relative to norethisterone (IC 50 ), as published by Kiralj and Ferreira. 39It was used to build a PLS model (model Id), 39 which was validated by certain methods.The original publication 39 contained descriptor pool (33 descriptors) and its subset related to model Id (15 descriptors).In this work, all the ISCP check levels were carried out, including the check levels 3a and 3b with 31 and 11 descriptors in the matrix X P , respectively.
Data Set I This is a QSPR data set of moderate size, containing eight simple descriptors (S 1 , S 2 , S 3 , S 4 , S 5 , S 6 , S 7 , S 8 ), indicator variables with values 0 and 1 for the absence and presence of chloro-substituents, respectively, in 62 polychlorinated naphthalenes, as published by Yin et al. 40 It was employed to build an MLR model (model M1/M2, Equations 2, 3) 40 for prediction of a chromatographic retention index (RI).The model was validated only by leave-one-out cross-validation.In this work, all the ISCP check levels were performed, with exception of the level 3 due to the lack of the descriptor pool.This type of data set, indicator variables that have only two distinct values, has been discussed previously as not recommendable for QSAR/QSPR studies. 14,15Data set I is validated in this work because such data set type still appear in current QSAR/QSPR literature.

Data Set J
This QSPR data set was recently published by Ahmadi. 41It consists of four descriptors of topological (IC5, LP1) and steric (E1v, RDF125m) nature, which were used to predict logarithm of the association constant (logK) of 53 macrocycles with sodium cation, via an MLR model with rudimentary validation.Data set J is the only one in this work as an example in which the reference data set i.e., the data set obtained from variable selection, has been obtained after data splitting of an essential descriptor pool subset.Therefore, the MLR model with 40 training and 13 external validation samples is the reference one, and is tested with the ISCP procedure.The descriptor pool (more than 350 descriptors) was not published. 27 , μ 4 Ab-Σβ2O , μ4Pols) for 232 samples, organic compounds from at least fifteen diverse classes.It was generated by Pérez-Garrido et al. 42 and used to build an MLR model to predict logarithm of the stability constants (logK) of β-cyclodextrin with diverse chemicals.The authors presented only an externally validated model (185 training and 47 external samples) that was extensively validated.It is unclear what was the exact size of the descriptor pool. 28The check level 1 for this data set was reported previously. 15he model for the complete set i.e., that one resulting from variable selection was not published and therefore.It is constructed in the present work and considered as the reference model.The model is tested by all SCP check levels except for the levels 3a and 3b due to the lack of the descriptor pool and its subsets.
Data Set L This is even a larger QSAR data set, consisting of four descriptors (constitutional: nX, nCaH; topological: CIC0; and electronic: HOMO) for 460 diverse volatile organic chemicals.It was published by Gramatica et al., 43 and used to build an MLR model to predict logarithm of the rate constant for hydroxyl radical tropo-spheric degradation of chemicals, -log k(OH).This data set is somewhat similar to data set K: the reference model was not reported by the authors although it was obtained from variable selection and, the published model (model 4) 43 was based on data split (234 training and 236 external samples) and extensively validated.The reference model was tested previously 15 at the ISCP check level 1.In this work, ISCP check levels 2, 4a, 4b, 5a and 5b are performed, with addition of a new model with switched roles of the split subsets (i.e.236 training and 234 external samples), as has been done for data set G. Descriptor pool of 1308 descriptors was not published. 43

Computational Procedures
Data were autoscaled prior to any calculation, and then used to reproduce selected PLS and MLR models from the literature.The choice of PLS and MLR is justified by the fact that these are the two commonest regression methods employed in QSAR/QSPR. 44Regression and other statistical analyses as well as diverse calculations were done by using programs Pirouette (version 4.0) 45 and Scilab (version 5.4.0), 46 and statistical significance in t-and F-tests was checked by online software GraphPad. 29or generating random vectors, gr and function in Scilab was used.The sign change of regression coefficients β ij of the j-th descriptor (Equation 1) in ν regressions (i = 1, 2, ..., ν) was counted by introducing the sign matrix with elements S ij = sign(β ij ), and then using a simple formula: Determination of thresholds in the random SCP check, depending mainly on the number of samples n, was carried out as illustrated in Figure 1.It is visible that the correlation coefficient ρ rd has an asymptotic-like behavior for each data set, i.e. after a certain region of values of the number of randomized vectors N rv it grows very slowly.For large n, it is practically impossible to pass the threshold ρ rd = 0.40 without a large computational time and memory expense.On the other hand, small data sets easily pass the threshold at very small values of N rv .More intuitive plots illustrate the nature of the ISCP check level 4, such as the sign change count (absolute sign change frequency NSC) depending on N rv (Figure 2) and on ρ rd (Figure 3), or relative sign change frequency (f SC ) depending on N rv (Figure 4).It is visible that models based on problematic data sets exhibit sign changes even for small number of randomized vectors, and this trend is emphasized in linear (Figure 2) and vertical asymptotic form (Figure 3), independently of the number n.In terms of f SC , sign changes stabilize only at high values of N rv (Figure 4).Figures 1 -4 show no regularity with respect to the number of selected descriptors m, which varies from 3 to 8 for data sets A -L. Nine smaller data sets (A -F, H -J) were tested up to N rv = 200,000, whilst for larger data sets (G, K, L) the maximum values of N rv were smaller due to computational time and memory limits.

General Considerations
A complete example of carrying out the ISCP procedure is given for data set A in Tables S1-S15 in Supplemental Material.ISCP check results for this data set are organized in two tabular forms: summary of data set and model statistics, and summary of descriptor statistics, as shown in Tables 2 and 3, respectively.Statistics summary for other data sets is in Tables S16 -S37 in Supplemental Material.In general, data set and model performance in the ISCP procedure should be used to decide about the tested data set, model of interest and descriptors, as is shown in Tables 1 -6: to go to the next step (model validation) or go back to variable selection and eventual data modification.
The data set and model statistics summary consists of three types of statistics for each ISCP check level: model statistics, descriptor statistics, and regression coefficient statistics.Each statistics can be expressed in two equivalent forms, as the relative sign change frequency i.e., fraction (sign change count)/(total count) for models, descriptors and regression coefficients, and also as the percentage value (given in brackets).Only for the ISCP check level 4b another quantity is reported, the value of ρ rd at which sign change starts occurring regularly.Among the three statistics, descriptor statistics gives an overall sign change appearance, whilst model statistics coincides with regression coefficient statistics for the ISCP check levels 3a, 3b and 4a.Zero sign change count in the ISCP check levels 1, 2 and 4 is a requirement for real models with satisfactory performance.However, it is practically impossible to expect that no sign change will occur at the ISCP levels 3a and 3b and therefore, up to 10 % relative sign change frequency is reasonable to tolerate for models with satisfactory performance (Table 4).
The most quantitative measure for sign changes in Table 2 is the regression coefficient statistics, which can be expressed in the form of sign change fractions for each test (S 1 , S 2 , S 3a , S 3b , S 4a , S 5a , S 5b ), whilst S 4a is the value of ρ rd from the ISCP level 4a.Values of these indices (Table 2) can be used to characterize data set and model performance, by introducing for example, a fivelevel performance qualitative labeling system (Table 4): excellent, good, acceptable, poor and extremely poor performance, which correspond to scores 5, 4, 3, 2 and 1, respectively.
Descriptor statistics (Table 3) can aid in deciding about the data set and model at descriptor level, such as excluding, replacing, or transforming descriptors or even excluding samples, carrying out new variable selection, making new data split, excluding outliers, among other actions.Results from each ISCP check level and visual check are taken into account, reporting performance of every descriptor as well as of the descriptor set.For descriptor set performance in visual check, 15 the poorest descriptor performance can be used as a rigorous criterion: problematic (problematic bivariate distribution), acceptable (some changes may improve the distribution), and excellent (no need for change), with scores 1, 2, and 3, respectively.The values of descriptor performance, expressed as percentage sign change frequency for all ISCP check levels (except for the level 4b where two parameters are reported -N rv and ρ rd ), can be used together with the visual check performance to finally characterize descriptors by single scores (applying the qualitative labeling system in Table 4).Weight for scores for less strict tests (ISCP levels 3a, 3b and 5a) is 0.5, otherwise is 1 (see Table 4).For example, the total score for descriptor E e from data set A (Table 3) is calculated as the sum of products (score × weight) to which the visual performance score is added: 5×1 (ISCP 1) + 4×1 (ISCP 2)+ 5×0.5 (ISCP 3a) + 4×0.5 (ISCP 3b) + 5×1 (ISCP 4a) + 5×0.5 (ISCP 4b: ρ rd performance) + 5×0.5 (ISCP 4b: N rv performance) + 5×0.5 (ISCP 5a) + 5×1 (ISCP 5b) + 2 (visual check) = 33.When this calculation is carried out for a perfect descriptor (i.e.descriptor with the best possible performance), one gets the value of 35.5.In terms of ideal performance, descriptor E e has relative score of 33/35.5 = 0.93 or 93 %, which can be used to assess the risk of using this descriptor in further modeling.Assuming a new label scheme for the total score (no risk to very low risk: 96 % -100 %; low risk: 86 % -95 %; moderate risk: 76 % -85 %; and high to very high risk: ≤ 75 %), descriptor E e can be characterized as a low risk descriptor.Calculating the risk scores for other descriptors, one finds that data set A contains one descriptor with none to very low risk, five low risk descriptors, and two moderate risk descriptors (Table 3).In general, a severe action must be made about all high risk descriptors, and probably a similar action is necessary for moderate risk descriptors, whilst descriptors with lower risk can stay untouched.Data sets with incomplete ISCP validation are somewhat problematic for descriptor risk assessment.In such cases, the total descriptor score can be expressed as a range, with its minimum value (based on results from performed ISCP check levels) and the maximum value (the minimum value is augmented with contributions from missing check levels for a hypothetical, perfect descriptor).For example, data set D could not be checked at the ISCP levels 3a and 3b due to the lack of the descriptor pool and its subsets and therefore, for descriptor x 15 the minimum score was found 29.5, and the maximum (supposing the perfect descriptor performance in check levels 3a and 3b) was obtained 34.5.Thus, the relative score range is 83 % -97 %, and its mean (90 % of the maximum possible score) corresponds to a low risk (Table S21 in Supplemental Information).
Summary of all data sets and models statistics is given in Table 5, and summary of all descriptors statistics is in Table 6, as based on data set and model analyses analogue to those in Table 2 and Table 3, respectively (analyses for data sets B -L are shown in Supplemental Material).At first, it is visible that most data sets could not be tested at the ISCP check levels 3a and 3b due to the lack of descriptor pools or pool subsets, and in some cases external validation sets were missing.Table 5 reports data set and model performance with parameters S 1 -S 5b in brackets, which agrees well with model characterization from visual check of descriptory scatterplots (penultimate column).Data set score (last column) is obtained as a sum of products (score × weight) for all ISCP check levels, using the performance qualitative labeling system from Table 4, to which the visual check score is added.b) Only parameters (regression coefficients) for the selected descriptors are taken into account; parameters for other (real and random) variables are not of interest; therefore, the models and descriptors have the same statistics in such cases.
(ISCP 3a) + 3×0.Data sets can be characterized also by the average descriptor score and overall risk of using descriptors, as is shown in Table 6, where descriptors are classified according to their risk levels.The average descriptor score is obtained simply from descriptor scores (for data set A these values are shown in Table 3), and the same five-level risk system can be applied to the average descriptor score, with certain corrections when necessary.These corrections take into account the presence of one or more descriptors with the poorest possible performance, what was not adequately included in the performance qualitative labeling system (Table 4).The poorest possible performance includes: sign change frequency of 75 % -100 % (ISCP check levels 1, 2, 3a, 3b and 4a), extremely small values of N rv (N rv < 25, ISCP check level 4b), and extremely poor t-and F-test performance (no statistical significance in 100 % cases, ISCP check levels 5a and 5b).For example, the value of the average descriptor score for data set F is 0.86, what would correspond to low overall risk of using Table 3. Descriptor statistics (a) in terms of sign changes for data set A and its models.  d) Total score is expressed in terms of the sum of score contributions along the ISCP check levels (rules given in Table 4), and as percentage of the maximum value of 35.5. (e) Descriptors are characterized as bearing low, moderate and high risk of the sign change problem for the use in multivariate modeling.Total risk means the risk of taking into account all selected descriptors.descriptors.However, there are three descriptors (underlined in Table 6) with the poorest possible performance at the ISCP check level 5b (Table S25 in Supplemental Information), and therefore, the estimated overall risk must be shifted by one level down, i.e. to moderate risk.The proposed performance qualitative labeling systems and their usage should be understood as an aid to find the best regression model in a QSAR/QSPR study, and they might be refined in future studies.

Performance of Data Sets and Models in the ISCP Check
The final scores for data sets and models (Table 5) and descriptors with the overall risk (Table 6) show that the best performance data sets and models are A, B and H.Among them, data set B has the best performance, with no necessity for any change in descriptors.Data set H has one moderate risk descriptor (M 11 ) which has, together with another one (S 6 '), the poorest possible performance at the ISCP check level 5a.This deficiency could be repaired, by which the published model would be refined.Data set A has satisfactory performance at all ISCP check levels except at the level 2, and this is probably due to many descriptors used (eight) and bimodal normal distribution of y. 17 The reported reference model could be refined by reducing the number of descriptors and perhaps including a new descriptor, once the model was validated with several standard model validations and additional checks known at the time of its publication. 14,15,17l other data sets, i.e.C -G, I -L are problematic, having at least one moderate or high risk descriptor (Table 6), unsatisfactory visual performance and relatively low score (Table 5).In such cases, it is wise to keep only descriptors that have no risk, very small or small risk for regression modeling, add new descriptors via variable selection, and then test a new model carrying out the complete ISCP check.
It can be noted from statistics summaries in Tables 5  and 6 that the sign change frequency does not depend on data set size, i.e. on the number of selected descriptors (m) and the number of samples for the reference model (n).Summary of data set and model characterization (Table 5) and summary of descriptor characterization (Table 6) are very useful diagnostics, because ISCP acts as a model validation which indicates what should be the next step after a model has been constructed i.e., either model validation or a new variable selection.High and even moderate risk descriptors in multivariate modeling should be removed or replaced, or eventually modified.One should also inspect particular ISCP check level performances, such as those for data set A (Tables 2 and 3), before the final decision about the data set, model and descriptors.
Finally, variable selection should end when all selected descriptors show low, very low or no risk to multivariate modeling, and the models have satisfactory performance: excellent performance in all ISCP check levels, with exception of ISCP checks 3a and 3b, in which the performance should be at least good or acceptable (Table 4).extremely poor For the ISCP check levels 1, 2, 3a, 3b, 3c, 4a, 4b, 5a and 5b regression coefficient performance indices are S 1 , S 2 , S 3a , S 3b , S 4a , S 4b , S 5a , S 5b , respectively, where S 4b is the value of ρ rd at which sign change starts appearing continuously with the increase of the number of random vectors, and other indices are fractions of regression coefficients of selected descriptors with sign changes.b) For the ISCP check levels 1, 2, 3a, 3b, 3c, 4a, 5a and 5b percentage sign change frequencies for descriptors are given in brackets.For the ISCP 4b level the values of ρ rd and N rv at which sign changes start occurring continuously are given in brackets.For both values weight is 0.5, whilst at all other ISCP levels the weight is always 1.
Influence of the number of samples n (in statistics: sample size) on statistical significance of descriptor -y relationships deserves a special attention in QSAR/QSPR.As Capraro 47 says, "given a large enough sample, one would always achieve statistical significance", because the value of a statistical test and its corresponding p-value depend not only on effect size and the level of α selected, but also on the number n. 48,49 Effect size, defined by Kenny, 50 is "the measure of the strength of effect as opposed to its p-value".2][53][54] Standard test statistic for one variable or for relationship between two variables, such as t-and F-test statistic, can be expressed as a product of sample size and effect size. 48,49The Pearson correlation coefficient is data set size-independent and therefore, should be reported together with the p-value from statistical significance testing of the relationship between two variables.Effect of data set size on test statistic and probability is well noticeable for large data sets.In this work, both data sets with the largest n, K (n = 232) and L (n = 460), possess one descriptor characterized with statistically significant relationship to y at the ISCP level 5b (t-and F-tests), but with low absolute values of the respective correlation coefficients: μ 1 Dip2 (data set K) with extreme statistical significance and r = 0.26, and nCaH (data set L) with statistical significance and r = 0.11.Such descriptors should not be used in further modeling, because they aid in producing misleading, falsely good models.Although the mentioned values of r are (52 % -66 %) (a) Sign change statistics expressed as data set and model performance at all ISCP check levels: a) in qualitative manner (excellent, negligible for slight negative performance, tolerable, bad and very bad) and quantitative manner (indices for each ISCP check level).For the ISCP check levels 1, 2, 3a, 3b, 3c, 4a, 4b, 5a and 5b indices are S 1 , S 2 , S 3a , S 3b , S 4a , S 4b , S 5a , S 5b , respectively, where S 4b is the value of ρ rd at which sign change starts appearing continuously with the increase of the number of random vectors, and other indices are fractions of regression coefficients of selected descriptors with sign changes.NA -information not available.Performance qualitative labeling system consists of five levels: excellent, good, acceptable, poor and extremely poor (extr.b) Visual check of selected descriptors in all data sets via descriptor -y scatterplots with general description: excellent, acceptable (some or all scatterplots have modest distribution problems), and problematic (some scatterplots are not acceptable for regression). (c) Total score is expressed in terms of the sum of score contributions along the ISCP check levels (rules given in Table 4), and as percentage of the maximum value of 35.5 (given in brackets).
statistically significant i.e., have 95 % probability of being different from zero, they measure weak associations of the two descriptors to y. Threshold r = 0.3 for moderately strong associations in QSAR studies 14,15 is also recommended in statistical literature. 48,51,52CP and Other Approaches to Treat Sign Changes PLS and MLR are still the two commonest regression methods in QSAR/QSPR.44QSAR/QSPR models made by using PLS, MLR and other simpler regression methods can be rather effectively checked for sign changes using the ISCP procedure.Hence, there is no need to abandon the mentioned regression methods and use more complicated ones which treat sign changes up to a certain extent.Regression methods that employ orthogonalized variables generated from original descriptors 55,56 such as PLS, successfully deal with descriptor multicollinearity, but are not effective against sign changes and besides, yield models that are difficult to interpret.It has been shown previously 15 that sign changes in PLS models are originated from multicollinearity and increased model complexity.In this work, examples of PLS models (A, B and H) also incorporate sign changes.In other words, the use of orthogonalized descriptors does not prevent the appearance of sign changes, which are visible when the model is expressed in terms of original descriptors.
Modern shrinkage regression methods, 57,58 such as ridge regression, LASSO (Least Absolute Shrinkage and Selection Operator) regression and its variants indirectly deal with sign changes, but do not completely solve the sign change problem.In all these methods descriptors with small coefficients are discarded, what enables partial solution of SCP, because eliminated descriptors can easily undergo sign changes during the regression modeling.In LASSO and its variants, descriptors whose regression coefficients are unstable (i.e. have large variations in size and even change signs) are also discarded.In fact, shrinkage regressions tend to minimize descriptors multicollinearity.Shrinkage methods are rather automated procedures in terms of variable selection.LASSO and its variants are based on preserving regression vector signs with respect to the MLR regression vector as the reference vector (Equation 1, with α = 0) that serves as the first estimate of regression coefficients, 57,58 and not with respect to the correlation vector from univariate regressions (Equation 2).Besides, the complete procedure is carried out for each data set.Therefore, shrinkage regressions do not preserve regression vector signs with respect to one reference vector, which would characterize the reference  (a) Descriptors are named in the same way as in original publications.c) Overall risk means the risk of taking into account all selected descriptors, based on average score.Overall risk is given in square brackets when it must be higher than predicted from the average score, due to the presence of descriptors with the poorest possible performance in the ISCP tests (underlined descriptors).
data set.Consequently, the sign change may occur in shrinkage regressions.ISCP introduced in this work can be considered as a general anti-SCP tool.A model based on a regression method that partially treats SCP should be also subjected to the ISCP check.
Spectral-SAR [59][60][61] is a novel, challenging approach in terms of its QSAR/QSPR philosophy and theoretical background, calculation of new goodness-of-fit indices, and model interpretation.Spectral-SAR regression models are Hansch-type equations with hydrophobic, electronic and steric descriptors, which are orthogonalized via the Gram-Schmidt algorithm.The spectral-SAR methodology should be further tested for eventual sign changes, which probably should be interpreted somewhat differently than in standard QSAR/QSPR.Similar can be said about alert-QSAR, 62 in which residual analysis is employed with the purpose to minimize residuals correlation with descriptors.
ISCP is based on the assumption that relationships between descriptors and y are linear, statistically significant, and sufficiently strong.However, when descriptory relations are not linear, and the non-linearity is included in a multivariate model but not in the univariate regressions, sign change may appear.Catastrophe-QSAR 63 uses Thom's polynomials to model non-linearity in multivariate regression (Equation 2), showing that the sign change in linear terms of a descriptor is caused by inadequate univariate regressions (Equation 1).Catastrophe-QSAR is an example of a new challenge for sign change treatment and interpretation in non-linear QSAR/QSPR.

CONCLUSION
The five-level integral sign change problem check established in this work can be considered as an effective anti-sign change problem methodology, acting as a new model validation that detects sign changes in QSAR/QSPR data sets and models.A detailed tutorial for the complete procedure and accompanying checks was presented.The procedure was applied to twelve QSAR/QSPR data sets and models, resulting data and model performance was reported and discussed in terms of data and model remedy.Performance qualitative labeling systems are proposed as an aid to characterize data set and model performance, and simplify human decision in this stage of modeling i.e., the choice between model validation and new variable selection with eventual data modification.Future research will be directed to further development and refinement of the proposed integral sign change problem check.The descriptor sign change problem is an issue to which no sufficient attention is paid in QSAR/QSPR research, which certainly contributes to generation of statistically false, deficient and low predictable regression models.Supplementary Materials.-Supporting informations to the paper are enclosed to the electronic version of the article.These data can be found on the website of Croatica Chemica Acta (http://public.carnet.hr/ccacaa).

Example for ISCP: Data set A (Tables S1 -S15)
Table S1.Basic information about data set A. Table S2.ISCP level 1: Comparing correlation and regression vectors to the reference vector for data set A.

Item
Vector*    S7 and S8 Table S9.ISCP levels 4a and 4b: sign change (SC) and ρ rd statistics* ,# for data set A.   S9: Sign change at the ISCP level 4a is equal to zero in terms of regression coefficients, and probably is the same for the SC frequency at the ISCP level 4b up to very high values of the random coefficient ρ rd (to achieve this limit N rv should be of the order of magnitude of millions).S10 and S11  Observation based on Tables S12 and S13: All descriptors are characterized by statistically significant slopes at this ISCP level.4), which is the maximum complexity of the multivariate model considered.Level 3a: the maximum complexity (l value) of the l-variate MLR models considered (11) that could be treated computationally; number of all descriptors excluding the selected descriptors (10).Level 4a: number of random descriptors (generating 25 random descriptors is sufficient to obtain correlation coefficient with respect to y around 0.40); correlation coefficient for randomization, ρ rd (in this case, 0.41).Level 4b: the minimum number of bivariate models at which sign change starts occurring continuously, or the maximum number of bivariate models tested if sign change has not been observed (in this case, it is 200,000); the minimum ρ rd at which sign change starts occurring continuously, or the maximum ρ rd reached if sign change has not been observed (in this case, 0.73).Level 5a and 5b: confidence level α, the probability threshold (0.05).Reported parameters for all ISCP levels are relative sign change frequencies expressed as ratios and percentages (in brackets), except for level 4b for which the value of ρ rd is reported (marked with sign ">>"only when probable region for ρ rd was determined, meaning that the value of ρ rd must be far from the lower limit i.e. close to 1).*Reported parameters for all ISCP levels are relative sign change frequencies expressed as ratios and percentages (in brackets), except for level 4b for which the value of ρ rd is reported (marked with sign ">>" if only probable region for ρ rd was determined, meaning that the value of ρ rd must be far from the lower limit i.e. close to 1).# The minimum number of bivariate models at which sign change starts occurring continuously, or the maximum number of bivariate models tested if sign change has not been observed (in this case, reported as a lower limit >>200,000, meaning that the true number of random vectors must be greater than the limit by one or more orders of magnitude).& The minimum ρ rd at which sign change starts occurring continuously, or the maximum ρ rd reached if sign change has not been observed (in this case, reported as a lower limit >>0.63, meaning that the value of ρ rd must be far from the lower limit i.e. close to 1).**Descriptors are characterized as bearing low, moderate and high risk of the sign change problem for the use in multivariate modelling.Total risk means the risk of taking into account all selected descriptors.## Total score is expressed in terms of the sum of score contributions along the ISCP check levels (rules given in Table 4), and as percentage of the maximum value of 35.5 (given in brackets).(5), which is the maximum complexity of the multivariate model considered.Level 4a: number of random descriptors (generating 2500 random descriptors is sufficient to obtain correlation coefficient with respect to y around 0.40); correlation coefficient for randomization, ρ rd (in this case, 0.43).Level 4b: the minimum number of bivariate models at which sign change starts occurring continuously (in this case, 2500), or the maximum number of bivariate models tested if sign change has not been observed; the minimum ρ rd at which sign change starts occurring continuously (in this case, 0.43), or the maximum ρ rd reached if sign change has not been observed.Level 5a and 5b: confidence level α, the probability threshold (0.05).Reported parameters for all ISCP levels are relative sign change frequencies expressed as ratios and percentages (in brackets), except for level 4b for which the value of ρ rd is reported (marked with sign ">" if only probable region for ρ rd was determined).

Observation based on Tables
Table S19.Descriptor statistics* in terms of sign changes for data set C and its models.*Reported parameters for all ISCP levels are relative sign change frequencies expressed as ratios and percentages (in brackets), except for level 4b for which the value of ρ rd is reported (marked with sign ">>" if only probable region for ρ rd was determined, meaning that the value of ρ rd must be far from the lower limit i.e. close to 1).# The minimum number of bivariate models at which sign change starts occurring continuously (in this case, 10), or the maximum number of bivariate models tested if sign change has not been observed (reported as a lower limit >>200,000, meaning that the true number of random vectors must be greater than the limit by one or more orders of magnitude).& The minimum ρ rd at which sign change starts occurring continuously (in this case, 0.21), or the maximum ρ rd reached if sign change has not been observed (reported as a lower limit >>0.53, meaning that the value of ρ rd must be far from the lower limit i.e. close to 1).**Descriptors are characterized as bearing low, moderate and high risk of the sign change problem for the use in multivariate modelling.Total risk means the risk of taking into account all selected descriptors.## Total score is expressed in terms of the sum of score contributions along the ISCP check levels (rules given in Table 4), and as percentage of the maximum value of 35.5 (given in brackets).(5), which is the maximum complexity of the multivariate model considered.Level 4a: number of random descriptors (generating 8000 random descriptors is sufficient to obtain correlation coefficient with respect to y around 0.40); correlation coefficient for randomization, ρ rd (in this case, 0.40).Level 4b: the minimum number of bivariate models at which sign change starts occurring continuously (in this case, 500), or the maximum number of bivariate models tested if sign change has not been observed; the minimum ρ rd at which sign change starts occurring continuously (in this case, 0.25), or the maximum ρ rd reached if sign change has not been observed.Level 5a and 5b: confidence level α, the probability threshold (0.05).Reported parameters for all ISCP levels are relative sign change frequencies expressed as ratios and percentages (in brackets), except for level 4b for which the value of ρ rd is reported (marked with sign ">" if only probable region for ρ rd was determined).
Table S21.Descriptor statistics* in terms of sign changes for data set D and its models.*Reported parameters for all ISCP levels are relative sign change frequencies expressed as ratios and percentages (in brackets), except for level 4b for which the value of ρ rd is reported (marked with sign ">>" if only probable region for ρ rd was determined, meaning that the value of ρ rd must be far from the lower limit i.e. close to 1).# The minimum number of bivariate models at which sign change starts occurring continuously (in this case, 500), or the maximum number of bivariate models tested if sign change has not been observed (reported as a lower limit >>200,000, meaning that the true number of random vectors must be greater than the limit by one or more orders of magnitude).& The minimum ρ rd at which sign change starts occurring continuously (in this case, 0.25), or the maximum ρ rd reached if sign change has not been observed (reported as a lower limit >>0.43, meaning that the value of ρ rd must be far from the lower limit i.e. close to 1).**Descriptors are characterized as bearing low, moderate and high risk of the sign change problem for the use in multivariate modelling.Total risk means the risk of taking into account all selected descriptors.## Total score is expressed in terms of the sum of score contributions along the ISCP check levels (rules given in Table 4), and as percentage of the maximum value of 35.5 (given in brackets).*Reported parameters for all ISCP levels are relative sign change frequencies expressed as ratios and percentages (in brackets), except for level 4b for which the value of ρ rd is reported (marked with sign ">>" if only probable region for ρ rd was determined, meaning that the value of ρ rd must be far from the lower limit i.e. close to 1).# The minimum number of bivariate models at which sign change starts occurring continuously (in this case, 20), or the maximum number of bivariate models tested if sign change has not been observed (reported as a lower limit >>200,000, meaning that the true number of random vectors must be greater than the limit by one or more orders of magnitude).& The minimum ρ rd at which sign change starts occurring continuously (in this case, 0.43), or the maximum ρ rd reached if sign change has not been observed(reported as a lower limit >>0.83, meaning that the value of ρ rd must be far from the lower limit i.e. close to 1).**Descriptors are characterized as bearing low, moderate and high risk of the sign change problem for the use in multivariate modelling.Total risk means the risk of taking into account all selected descriptors.## Total score is expressed in terms of the sum of score contributions along the ISCP check levels (rules given in Table 4), and as percentage of the maximum value of 35.5 (given in brackets).*Reported parameters for all ISCP levels are relative sign change frequencies expressed as ratios and percentages (in brackets), except for level 4b for which the value of ρ rd is reported (marked with sign ">>" if only probable region for ρ rd was determined, meaning that the value of ρ rd must be far from the lower limit i.e. close to 1).# The minimum number of bivariate models at which sign change starts occurring continuously (in this case, 500, 5000 and 150000), or the maximum number of bivariate models tested if sign change has not been observed (reported as a lower limit >>200,000, meaning that the true number of random vectors must be greater than the limit by one or more orders of magnitude).& The minimum ρ rd at which sign change starts occurring continuously (in this case, 0.61, 0.73 and 0.81), or the maximum ρ rd reached if sign change has not been observed (reported as a lower limit >>0.84, meaning that the value of ρ rd must be far from the lower limit i.e. close to 1).**Descriptors are characterized as bearing low, moderate and high risk of the sign change problem for the use in multivariate modelling.Total risk means the risk of taking into account all selected descriptors.## Total score is expressed in terms of the sum of score contributions along the ISCP check levels (rules given in Table 4), and as percentage of the maximum value of 35.5 (given in brackets).*Reported parameters for all ISCP levels are relative sign change frequencies expressed as ratios and percentages (in brackets), except for level 4b for which the value of ρ rd is reported (marked with sign ">>" if only probable region for ρ rd was determined, meaning that the value of ρ rd must be far from the lower limit i.e. close to 1).# The minimum number of bivariate models at which sign change starts occurring continuously (in this case, 10 and 50), or the maximum number of bivariate models tested if sign change has not been observed (reported as a lower limit >>160,000, meaning that the true number of random vectors must be greater than the limit by one or more orders of magnitude).& The minimum ρ rd at which sign change starts occurring continuously (in this case, 0.13 and 0.15), or the maximum ρ rd reached if sign change has not been observed (reported as a lower limit >>0.37, meaning that the value of ρ rd must be far from the lower limit i.e. close to 1).**Descriptors are characterized as bearing low, moderate and high risk of the sign change problem for the use in multivariate modelling.Total risk means the risk of taking into account all selected descriptors.

##
This performance was predicted from plots (Figures 1-4 in the article) based on the fact that ρ rd = 0.40 was not reached up to 150,000 random vectors applied.### Total score is expressed in terms of the sum of score contributions along the ISCP check levels (rules given in Table 4), and as percentage of the maximum value of 35.5 (given in brackets).*Reported parameters for all ISCP levels are relative sign change frequencies expressed as ratios and percentages (in brackets), except for level 4b for which the value of ρ rd is reported (marked with sign ">>" if only probable region for ρ rd was determined, meaning that the value of ρ rd must be far from the lower limit i.e. close to 1).# The minimum number of bivariate models at which sign change starts occurring continuously (in this case, 10), or the maximum number of bivariate models tested if sign change has not been observed (reported as a lower limit >>200,000, meaning that the true number of random vectors must be greater than the limit by one or more orders of magnitude).& The minimum ρ rd at which sign change starts occurring continuously (in this case, 0.25), or the maximum ρ rd reached if sign change has not been observed (reported as a lower limit >>0.68, meaning that the value of ρ rd must be far from the lower limit i.e. close to 1).**Descriptors are characterized as bearing low, moderate and high risk of the sign change problem for the use in multivariate modelling.Total risk means the risk of taking into account all selected descriptors.## Total score is expressed in terms of the sum of score contributions along the ISCP check levels (rules given in Table 4), and as percentage of the maximum value of 35.5 (given in brackets).(10), which is the maximum complexity of the multivariate model considered.Level 4a: number of random descriptors (generating 120000 random descriptors is still not sufficient to obtain correlation coefficient with respect to y around 0.40); correlation coefficient for randomization, ρ rd (in this case, 0.29).Level 4b: the minimum number of bivariate models at which sign change starts occurring continuously, or the maximum number of bivariate models tested if sign change has not been observed (in this case, 120,000); the minimum ρ rd at which sign change starts occurring continuously, or the maximum ρ rd reached if sign change has not been observed (in this case, 0.29).Level 5a and 5b: confidence level α, the probability threshold (0.05).Reported parameters for all ISCP levels are relative sign change frequencies expressed as ratios and percentages (in brackets), except for level 4b for which the value of ρ rd is reported (marked with sign ">>"only when probable region for ρ rd was determined, meaning that the value of ρ rd must be far from the lower limit i.e. close to 1).**This performance was predicted from plots (Figures 1-4 in the article) based on the fact that no sign change has been observed up to 120,000 random vectors applied (memory limit reached).
Table S35.Descriptor statistics* in terms of sign changes for data set K and its models.## 0(0%) ## 0(0%) ## 0(0%) ## 0(0%) ## 0(0%) ## 0(0%) ## 0(0%) ## 0(0%) ## 0(0%) ## 0(0%) # *Reported parameters for all ISCP levels are relative sign change frequencies expressed as ratios and percentages (in brackets), except for level 4b for which the value of ρ rd is reported (marked with sign ">>" if only probable region for ρ rd was determined, meaning that the value of ρ rd must be far from the lower limit i.e. close to 1).# The minimum number of bivariate models at which sign change starts occurring continuously, or the maximum number of bivariate models tested if sign change has not been observed (reported as a lower limit >>120,000, meaning that the true number of random vectors must be greater than the limit by one or more orders of magnitude).& The minimum ρ rd at which sign change starts occurring continuously, or the maximum ρ rd reached if sign change has not been observed (reported as a lower limit >>0.29, meaning that the value of ρ rd must be far from the lower limit i.e. close to 1). .Level 5a and 5b: confidence level α, the probability threshold (0.05).Reported parameters for all ISCP levels are relative sign change frequencies expressed as ratios and percentages (in brackets), except for level 4b for which the value of ρ rd is reported (marked with sign ">>"only when probable region for ρ rd was determined, meaning that the value of ρ rd must be far from the lower limit i.e. close to 1).**This performance was predicted from plots (Figures 1-4 in the article) based on the fact that no sign change has been observed up to 60,000 random vectors applied (memory limit reached).## 0(0%) ## 0(0%) ## 0(0%) ## 0(0%) # *Reported parameters for all ISCP levels are relative sign change frequencies expressed as ratios and percentages (in brackets), except for level 4b for which the value of ρ rd is reported (marked with sign ">>" if only probable region for ρ rd was determined, meaning that the value of ρ rd must be far from the lower limit i.e. close to 1).# The minimum number of bivariate models at which sign change starts occurring continuously, or the maximum number of bivariate models tested if sign change has not been observed (reported as a lower limit >>60,000, meaning that the true number of random vectors must be greater than the limit by one or more orders of magnitude).& The minimum ρ rd at which sign change starts occurring continuously, or the maximum ρ rd reached if sign change has not been observed (reported as a lower limit >>0.19, meaning that the value of ρ rd must be far from the lower limit i.e. close to 1).**Descriptors are characterized as bearing low, moderate and high risk of the sign change problem for the use in multivariate modelling.Total risk means the risk of taking into account all selected descriptors.## This performance was predicted from plots (Figures 1-4 in the article) based on the fact that no sign change has been observed up to 60,000 random vectors applied (memory limit reached).### Total score is expressed in terms of the sum of score contributions along the ISCP check levels (rules given in Table 4), and as percentage of the maximum value of 35.5 (given in brackets).

Figure 1 .
Figure 1.Dependence of the correlation coefficient for randomization (ρ rd ) on the number of random vectors (N rv ) and number of samples in data sets (n).Threshold ρ rd = 0.40 is marked by a dotted line.

Figure 2 .
Figure 2. Dependence of the number (absolute frequency) of sign changes (N SC ) on the number of random vectors (N rv ).The number of sign changes is independent on the number of samples in data sets (n).Four data sets with zero sign changes frequencies (A, B, K, L) are not presented.

Figure 3 .
Figure 3. Dependence of the number (absolute frequency) of sign changes (N SC ) on the correlation coefficient for randomization (ρ rd ).The number of sign changes is independent on the number of samples in data sets (n).Threshold ρ rd = 0.40 is marked by a dotted line.Four data sets with zero sign changes frequencies (A, B, K, L) are not presented.

Figure 4 .
Figure 4. Dependence of the relative frequency of sign changes (f SC , expressed as percentage) on the number of random vectors (N rv ).The relative frequency of sign changes is independent on the number of samples in data sets (n).Four data sets with zero sign changes frequencies (A, B, K, L) are not presented.
Values Data sets* -Complete data set (n c = 50) -Subsets after splitting: training (n t = 40) and external validation (n e = 10) sets Selected descriptors 8 (E e , E cc , Q Omul , Δ HL , σ b , σ r , D CC , Q C2mul ) QSPR models -PLS model for the complete data set (n c = 50) -Proposed PLS model for the training set (n t = 40) with external validation (n e = 10) Reference data set Complete data set (obtained from variable selection) Reference model PLS model for the complete data set (obtained from variable selection) *Number of samples in the complete, training and external validation set is marked by n c , n t and n e , respectively.

:
Sign change was noticed in this SCP check, i.e. the sign change frequency at the ISCP level 3b is equal to 4.60% in terms of regression coefficients, what is a very small sign change frequency.

24 = 8 .
: Two (Δ HL and σ b ) out of eight descriptors are not characterized by statistically significant intercepts, what does not still mean significantly bad performance of the data set A at this ISCP level (sign change frequency is only 2/33%).

Table 1 .
Dependence of the number of multivariate submodels, number of regression coefficients and number of regression coefficients per descriptor on the number of selected descriptors m Total number of submodels for a given value of m is shown as the sum of the numbers of all l-variate models, where numbers of bivariate, trivariate, tetravariate, etc. models are added sequentially from left to right.The random descriptors are used in combination with each selected descriptor to build bivariate MLR models.As at the SCP level 3, sign changes are counted for each selected descriptors in all models, the samples' position vector p (its transpose is p T = [1, 2, 3, …, n]).rv -number of random vectors).Each set will be used to test all the selected descriptors.For each set i.e. for each value of N rv , the correlation coefficient for randomization (ρ rd ), defined as the average of the absolute values of the maximum and minimum correlation coefficients for random descriptor -y relationships, is calculated.Another parameter is calculated for each value of N rv : the relative sign change frequency for all selected descriptors (f SC ).Using a table with values of N rv , f SC and ρ rd , it is possible to identify QSAR/QSPR models with satisfactory performance.If sign change appears at small values of ρ rd or N rv , the model has poor performance, while if sign change occurs only at high values of these parameters, then the model has a satisfactory performance.The level variant 4a is directed to find the smallest value of N rv , whilst the level variant 4b identifies the smallest value of ρ rd , at which sign change starts occurring constantly.Values ρ rd = 0.40 and N rv = 100 are proposed as reasonable empirical thresholds.
16ta split into 44 training and 20 external validation samples was made, and a new MLR model was constructed by Kiralj and Ferreira and inspected by the simple SCP check,15and later the full SCP check.16Bothsigncheckshave shown that data set C and the respective MLR models were based on sign changes.In this work, ISCP check levels 4 and 5 were carried out.15,x 25 , x 26 , x 27 , x 36 ) for 114 samples, generated by Qin et al.
MLR model using the complete data set (53 samples) is built in this work, and was inspected at the ISCP check levels 1, 2, 4 and 5.

Table 2 .
Data set A and its model statistics(a)in terms of sign changes ISCP parameters.Level 1: No. models considered (2).Level 2: No. descriptors(8), which is the maximum complexity of the multivariate model considered.Level 3a: the maximum complexity (l value) of the l-variate MLR models considered (2) that could be treated computationally; number of all descriptors excluding the selected descriptors (101).Level 3b: the maximum complexity (l value) of the l-variate MLR models considered (4) that could be treated computationally; number of descriptors used for testing(43).Level 4a: number of random descriptors (generating 500 random descriptors is sufficient to obtain correlation coefficient with respect to y around 0.40); correlation coefficient for randomization, ρ rd (0.43).Level 4b: the minimum number of bivariate models at which sign change starts occurring continuously, or the maximum number of bivariate models tested if sign change has not been observed (200,000); the minimum ρ rd at which sign change starts occurring continuously, or the maximum ρ rd reached if sign change has not been observed (0.63).Level 5a and 5b: confidence level α, the probability threshold (0.05).

Table 4 .
Performance qualitative labeling system for all ISCP check levels. (a),(b)

Table 5 .
Summary of all data set and model statistics in terms of sign changes(a)

Table 6 .
Summary of descriptor statistics(a)in terms of risk of sign changes in multivariate modelling.

Observation based on Table S1: No sign change was noticed in this SCP check, i.e. the sign change frequency at the ISCP level 1 is equal to zero for all descriptors.
*Correlation vectors: r c -for complete data set, r t -for training set, r e -for external validation set.Regression vectors: β c -for complete data set, β t -for training set.The reference vector is the correlation vector r c , typed in bold # R 2correlation coefficient of multiple determination; Q 2cross-validated correlation coefficient; LVs -number of latent variables with corresponding percentage (%) of the original information.*Bivariateandhigher l-variate PLS regression models and their numbers (given in brackets).#TotalSCPcount (absolute sign change frequency) in l-variate PLS regression models TableS4.ISCP level 2: Final sign change (SC) statistics for data set A.

Observation based on Tables S3 and S4: Sign change was noticed in this SCP check, i.e. the sign change frequency at the ISCP level 2 is equal to 4.23% in terms of regression coefficients, what is not a substantially problematic sign change frequency.
*Calculations could not be performed for higher MLR regressions: several trivariate MLR models could not be constructed because of the matrix singularity problem.

Table S6
. ISCP level 3a: Final sign change statistics for data set A (bivariate MLR models).*Total sign change count: 6 in 808 regression coefficients or 0.74%.

Observation based on Tables S5 and S6: Sign change was noticed in this SCP check, i.e. the sign change frequency at the ISCP level 3a is equal to 0.74% in terms of regression coefficients, what is not a negligible sign change frequency.
*Descriptor pool subset contains descriptors whose absolute values of correlation coefficients related to the dependent variable are |r|>0.60.# Calculations could not be performed for higher MLR regressions because of time-and memoryconsuming problems.

Table S8
. ISCP level 3b: Final sign change statistics for data set A (bivariate to tetravariate MLR models).*Bivariate and higher l-variate MLR regression models and their numbers (given in brackets).#Total sign change count (absolute sign change frequency) in l-variate MLR regression models: 4887 in 106296 models or 4.60%.
N rv -No.random vectors; r min , r maxthe minimum and maximum correlation coefficients for correlations between the random vectors and the dependent variable; ρ rdthe correlation coefficient for randomization, given as the average, ρ rd = (|r min | + |r max |)/2; SC -total sign change (SC) count (absolute SC frequency); %SCrelative SC frequency, defined as the ration of the absolute SC frequency and the number of random vectors.#Statisticsfor the ISCP level 4a is typed in bold black: for ρ rd = 0.43 the SC frequency is equal to zero.Statistics for the ISCP level 4b is typed in bold red: this is not a definitively statistics but statistics indicating that non-zero SC frequency probably occurs at very high values of ρ rd , certainly at ρ rd > 0.62.Calculations stopped at N rv = 200,000 because of time-and memory-consuming problems. *

Table S10 .
ISCP level 5a: t-Test values for intercepts of univariate descriptory relationships for data set A.

Table S11 .
ISCP level 5a: Statistical significance of intercepts of univariate descriptory relationships for data set A (t-test*).
*Not acceptable statistical significance is typed in bold.

Table S12 .
ISCP level 5b: t-Test (up) and F-test (down) for slopes of univariate descriptory relationships for data set A (t-and F-tests).

Table S13
. ISCP level 5b: Statistical significance of slopes of univariate descriptory relationships for data set A (t-and F-tests*).*Resultsfor the t-and F-tests for data set A are exactly the same for all descriptors.

Table S14 .
Legend for statistical significance for ISCP levels 5a and 5b.* Statistical significance levels NSS and NQSS are considered as not acceptable in the ISCP analyses in this work.The qualitative labelling of statistical significance levels is from the GraphPad QuickCalcs software (GraphPad Software, Inc., La Jolla, CA, 2013.Last access on June 23, 2014 at http://www.graphpad.com/quickcalcs/DistMenu.cfm]. *

Table S15 .
Additional checks for the data set A.

Table S16 .
Data set B and its model statistics # in terms of sign changes.Only parameters (regression coefficients) for the selected descriptors are taken into account; parameters for other (real and random) variables are not of interest; therefore, the models and descriptors have the same statistics in such cases.#ISCPparameters.Level 1: No. models considered (2).Level 2: No. descriptors ( *

Table S17 .
Descriptor statistics* in terms of sign changes for data set B and its models.

Table S18 .
Data set C and its model statistics# in terms of sign changes.Only parameters (regression coefficients) for the selected descriptors are taken into account; parameters for other (real and random) variables are not of interest; therefore, the models and descriptors have the same statistics in such cases.#ISCPparameters.Level 1: No. models considered (2).Level 2: No. descriptors

Table S20 .
Data set D and its model statistics # in terms of sign changes.Only parameters (regression coefficients) for the selected descriptors are taken into account; parameters for other (real and random) variables are not of interest; therefore, the models and descriptors have the same statistics in such cases.#ISCPparameters.Level 1: No. models considered (2).Level 2: No. descriptors

Table S22 .
Data set E and its model statistics # in terms of sign changes.Only parameters (regression coefficients) for the selected descriptors are taken into account; parameters for other (real and random) variables are not of interest; therefore, the models and descriptors have the same statistics in such cases.#ISCPparameters.Level 1: No. models considered (2).Level 2: No. descriptors(3), which is the maximum complexity of the multivariate model considered.Level 3b: the maximum complexity (l value) of the l-variate MLR models considered (6) that could be treated computationally; number of descriptors used for testing(19).Level 4a: number of random descriptors (generating 20 random descriptors is sufficient to obtain correlation coefficient with respect to y around 0.40); correlation coefficient for randomization, ρ rd (in this case, 0.43).Level 4b: the minimum number of bivariate models at which sign change starts occurring continuously (in this case, 20), or the maximum number of bivariate models tested if sign change has not been observed; the minimum ρ rd at which sign change starts occurring continuously (in this case, 0.43), or the maximum ρ rd reached if sign change has not been observed.Level 5a and 5b: confidence level α, the probability threshold (0.05).Reported parameters for all ISCP levels are relative sign change frequencies expressed as ratios and percentages (in brackets), except for level 4b for which the value of ρ rd is reported (marked with sign ">" if only probable region for ρ rd was determined).

Table S23 .
Descriptor statistics* in terms of sign changes for data set E and its models.

Table S24 .
Data set F and its model statistics# in terms of sign changes.Only parameters (regression coefficients) for the selected descriptors are taken into account; parameters for other (real and random) variables are not of interest; therefore, the models and descriptors have the same statistics in such cases.# ISCP parameters.Level 1: No. models considered (2).Level 2: No. descriptors (4), which is the maximum complexity of the multivariate model considered.Level 4a: number of random descriptors (generating 25 random descriptors is sufficient to obtain correlation coefficient with respect to y around 0.40); correlation coefficient for randomization, ρ rd (in this case, 0.40).Level 4b: the minimum number of bivariate models at which sign change starts occurring continuously (in this case, 500), or the maximum number of bivariate models tested if sign change has not been observed; the minimum ρ rd at which sign change starts occurring continuously (in this case, 0.61), or the maximum ρ rd reached if sign change has not been observed.Level 5a and 5b: confidence level α, the probability threshold (0.05).Reported parameters for all ISCP levels are relative sign change frequencies expressed as ratios and percentages (in brackets), except for level 4b for which the value of ρ rd is reported (marked with sign ">" if only probable region for ρ rd was determined).

Table S25 .
Descriptor statistics* in terms of sign changes for data set F and its models.

Table S26 .
Data set G and its model statistics # in terms of sign changes.Only parameters (regression coefficients) for the selected descriptors are taken into account; parameters for other (real and random) variables are not of interest; therefore, the models and descriptors have the same statistics in such cases.#ISCPparameters.Level 1: No. models considered (3).Level 2: No. descriptors(5), which is the maximum complexity of the multivariate model considered.Level 4a: number of random descriptors (generating 150000 random descriptors is still not sufficient to obtain correlation coefficient with respect to y around 0.40); correlation coefficient for randomization, ρ rd (in this case, 0.37).Level 4b: the minimum number of bivariate models at which sign change starts occurring continuously (in this case, 10), or the maximum number of bivariate models tested if sign change has not been observed; the minimum ρ rd at which sign change starts occurring continuously (in this case, 0.13), or the maximum ρ rd reached if sign change has not been observed.Level 5a and 5b: confidence level α, the probability threshold (0.05).Reported parameters for all ISCP levels are relative sign change frequencies expressed as ratios and percentages (in brackets), except for level 4b for which the value of ρ rd is reported (marked with sign ">" if only probable region for ρ rd was determined).

Table S27 .
Descriptor statistics* in terms of sign changes for data set G and its models.

Table S30 .
Data set I and its model statistics# in terms of sign changes.Only parameters (regression coefficients) for the selected descriptors are taken into account; parameters for other (real and random) variables are not of interest; therefore, the models and descriptors have the same statistics in such cases.#ISCPparameters.Level 1: No. models considered (1).Level 2: No. descriptors(8), which is the maximum complexity of the multivariate model considered.Level 4a: number of random descriptors (generating 2000 random descriptors is sufficient to obtain correlation coefficient with respect to y around 0.40); correlation coefficient for randomization, ρ rd (in this case, 0.41).Level 4b: the minimum number of bivariate models at which sign change starts occurring continuously (in this case, 30,000), or the maximum number of bivariate models tested if sign change has not been observed; the minimum ρ rd at which sign change starts occurring continuously (in this case, 0.50), or the maximum ρ rd reached if sign change has not been observed.Level 5a and 5b: confidence level α, the probability threshold (0.05).Reported parameters for all ISCP levels are relative sign change frequencies expressed as ratios and percentages (in brackets), except for level 4b for which the value of ρ rd is reported (marked with sign ">" if only probable region for ρ was determined).

Table S31 .
Descriptor statistics* in terms of sign changes for data set I and its models.

Table S32 .
Data set J and its model statistics # in terms of sign changes.Only parameters (regression coefficients) for the selected descriptors are taken into account; parameters for other (real and random) variables are not of interest; therefore, the models and descriptors have the same statistics in such cases.# ISCP parameters.Level 1: No. models considered (2).Level 2: No. descriptors (4), which is the maximum complexity of the multivariate model considered.Level 4a: number of random descriptors (generating 50 random descriptors is sufficient to obtain correlation coefficient with respect to y around 0.40); correlation coefficient for randomization, ρ rd (in this case, 0.40).Level 4b: the minimum number of bivariate models at which sign change starts occurring continuously (in this case, 10), or the maximum number of bivariate models tested if sign change has not been observed; the minimum ρ rd at which sign change starts occurring continuously (in this case, 0.25), or the maximum ρ rd reached if sign change has not been observed.Level 5a and 5b: confidence level α, the probability threshold (0.05).Reported parameters for all ISCP levels are relative sign change frequencies expressed as ratios and percentages (in brackets), except for level 4b for which the value of ρ rd is reported (marked with sign ">" if only probable region for ρ rd was determined).

Table S33 .
Descriptor statistics* in terms of sign changes for data set J and its models.

Table S34 .
Data set K and its model statistics # in terms of sign changes.Only parameters (regression coefficients) for the selected descriptors are taken into account; parameters for other (real and random) variables are not of interest; therefore, the models and descriptors have the same statistics in such cases.#ISCPparameters.Level 1: No. models considered (3).Level 2: No. descriptors

Table S36 .
Data set L and its model statistics # in terms of sign changes.Only parameters (regression coefficients) for the selected descriptors are taken into account; parameters for other (real and random) variables are not of interest; therefore, the models and descriptors have the same statistics in such cases.# ISCP parameters.Level 1: No. models considered (3).Level 2: No. descriptors (4), which is the maximum complexity of the multivariate model considered.Level 4a: number of random descriptors (generating 60,000 random descriptors is still not sufficient to obtain correlation coefficient with respect to y around 0.40); correlation coefficient for randomization, ρ rd (in this case, 0.19).Level 4b: the minimum number of bivariate models at which sign change starts occurring continuously, or the maximum number of bivariate models tested if sign change has not been observed (in this case, 60,000); the minimum ρ rd at which sign change starts occurring continuously, or the maximum ρ rd reached if sign change has not been observed (in this case, 0.19) *

Table S37 .
Descriptor statistics* in terms of sign changes for data set L and its models.