Effect of variable allocation on validation and optimality parameters and on cross-optimization perspectives

We investigated the allocation dependence of validation and design of experiment optimality parameters by modifying the pattern of the independent variables. They were scanned from a centre-like to corner-like positions and ﬁ xed design of experiment settings were checked, too. The response variable was modelled with multivariate linear regression. The calculations were performed on simulated data in two and 4-dimensional independent variable spaces and on datasets of power plant and QSAR studies. Our results showed that almost all parameters evaluating validation or design of experiment optimality could be tuned by intentionally increasing the number of samples taken close to the edges of the variable domain. We got strong rank correlation among most of the parameters. It coincides well with the primary aim of design of experiment to obtain validatable models, where design of experiment is performed before the model building, or even before the experiments, and validation takes place after the model building. Surprisingly, the best subsets of the selected samples for the different parameters overlapped only weakly. We found a reasonable cross optimization power only for some goodness-of-ﬁ t and robustness parameters, the overlap was rather weak concerning predictivity and design of experiment optimality parameters. For the simulated data we calculated the exact integrated error of prediction, its value strongly depended on the error features of the individual subsets and showed minimal cross-optimization power. We might conclude that design of experiment using optimality parameters need not to provide the allocations with best validation parameters in many cases.


Introduction
Modelling is one of the basic activities in science.In order to term a model feasible, we have to follow a conscious way and take into consideration several questions.One is related to the aim of the model, it should be either a pragmatic one providing meaningful practical applications or a didactic one, where the model helps in the interpretation and explanation of a phenomenon.A second condition relates to the data used for modelling, they should entirely span the later application domain or the natural incidence of the phenomenon.The data aspect can be divided into questions related to the objects (termed as sampling) and the selection of the variables.There are several objectives in sampling, as how they span the necessary variable space and applicability domain, how representative are the objects, what is the uncertainty of the recorded values and what is the cost of data accumulation.From the variable side, we have to find an optimum number of variables to develop a feasible model.In the case of black box models, the variables need not be interpretable, but in the case of didactic models or cases where human intuition is necessary for future applications, the variables should contain some interpretable scientific background.A typical case is QSAR modelling, where most of the descriptors lost their interpretability, therefore, they can be used only in black-box screening, while didactic models based, e.g., on physicochemical data help to find new active compounds using human intuition.The next tasks in modelling are to select an appropriate mathematical function or modelling method and to define the statistical conditions concerning the error and the estimator.Thereafter, the parameters of the model are obtained, but it is not any longer as simple as a minimization of the estimator function using all data, because the model has to be validated finally.For a scientific model it is mandatory to be validated by quantitative numerical methods and it has to pass the validation.
In our investigation we focus on the link between two sections of this modelling scheme.The first is a part of the sampling step, how the values of the independent variables are chosen from their domain.In the case of conscious modelling, it is mostly managed by design of experiment.The other part is related to the quantitative part of the validation, where validation parameters are calculated.
The actual setting of the values of the independent variables is chosen often routinely without detailed consideration.On the contrary, trained specialists like to apply well-known schemes fitted primarily to the experimental setup, human resources or budgeting.In more and more cases the statistical background plays a role as well.When model building is the aim of the experiments, the specialists optimize the performance of the future model prior the experiments.This can be done by the careful selection of the settings, or mathematically by defining the allocation of the values of the independent variables.
There are two main tasks related to which decision on the allocation of the values plays a role.It might be performed before the experiments, when we focus on the allocation of the predictor variable values in order to provide feasible models with minimal experimental cost.It is design of experiment (DOE).The other case is to select a representative subset of the samples from an existing (often large) dataset just before model building.Here the task is to develop a model, where the interpretation and the validation of the model are done during this model building and with later reparametrization, repetition of measurements or calibration transfer already in mind.The methods of determining optimal allocations need not be different for the two cases.Both effect the last step of modelling, the validation of the model, irrespectively, if internal or external validation will be performed.In the case of the latter, the data are split into training and test sets.There is also a more complicated case where three sets are formed as training, test and validation ones.Unfortunately, the number of data does not allow it usually in chemistry.
Design of experiment routines are included in several commercial software packages where the users are tutored in the manuals [1,2].The simplest case is to use fix designs, e.g.factorial designs, composite designs or more elaborated ones as the Box-Behnken design [3,4].The validation parameters of the models are better in this way than in the case of random or stratified sampling.The more advanced designs define optimality parameters, usually abbreviated with a capital letter.The minima or the maxima of the optimality parameter is searched over subsets of the full design matrix, where the latter consists of several settings of the predictor variables [1,[5][6][7].The methods can be used also for representative subset selection, if the design matrix contains existing experimental data.In some cases, the response vector is included as well.The data selected by the DOE are usually at the edges of the predictor variable domain.It means, we build a model on data quite far from the later frequently visited domain of the predictor variables, as it was emphasized by Kjeldahl and Bro [8].
There are several methods designed for choosing a representative subset from a set of existing multivariate data.Some examples based on cluster, density or seed point [9][10][11][12], distribution and statistical property [13,14] and interval [15,16] methods can be found in the references.The algorithms often provide a solution for the training-test split of the data.The most well-known method is the Kennard Stone or CADEX algorithm [17], where a distance-based selection is managed in the common space of the predictor and the response values.The result is a test set, where mostly all test points are close to one-one training data.There are several algorithms whereon some applications can be found in Refs.[18][19][20][21].Some authors claimed that the different methods of rational test/training split provide better validation parameters but predictivity remains the same [20].A general rule is suggested that the allocation of test and training values should be close to each other and the training set should be diverse [21].There are some magic ratios for the test/training size, like 20% or 50% of the data should be in the test set.The lack of a general rule is a result of dataset dependence, as it was stated e.g. in Refs.[22,23].
If we have selected the training and the test sets, we start the model building.Thereafter the validation of the model is necessary.The QSAR guidelines [24] sort the validation parameters into three categories, the goodness of fit (R 2 , RMSE, variance of fitted parameters…), the robustness (Q 2 …) and the predictivity ones (Q 2 F2 …).The first and the second ones are calculated on the training set and therefore called internal validation metrics, while predictivity is calculated on external test sets.
If we check the procedure, there is a selection step where the allocation of the predictor values is optimized in some mathematical way and there is a final validation step where the model is assessed.The aim of our study was to connect the optimality values used in the first step to validation parameters used in the final one in numerical experiments.We used a method where not different optimizations were performed, but a latent feature of the allocation process was used.As it has been mentioned, design of experiment methods mostly prefer to corner-like arrangement of the predictor values over fully random arrangement.In the first part of our paper we show some examples of how validation and optimality parameters behave if the setup is varied from a mid-like biased to a corner-like biased selection of existing data points.In the second part we focus on the cross-optimization power among optimality and validation parameters in the case of biased sampling.Cross-optimization power means in our case, that if we optimize our sampling in order to improve an optimality or validation parameter what extent does it improve another parameter.

Calculation details and data sets
The calculations were performed using scripts written in R 2 .Both simulated and experimental/calculated data were used to check the effect of allocation on validation and optimality parameters in the case of multivariate linear regression.

Simulated data
Here we detail the calculation of the two independent variable case.For a regression sample n ¼ 9 pairs of the two independent variables were generated from the [0,1) interval in a biased way.Thereafter, the corresponding y-s were calculated as where the last term was a randomly added white noise with a given σ, e.g., 0.1.After performing the multivariate linear regression the validation and the optimality parameters were calculated for the set.For all biased sampling steps the procedure was repeated 1000 times.
The biasing in the generation of the independent variables was done in a systematic way to produce samples ranging from a close to centre allocation to a close to corner like allocation in 2D.The biasing was managed by multipliers within an exponential function.The allocation tuning parameter was set to 15 different values.Fig. 1 shows three scatter plots of around 1000 data points, where the first is the most centred set, the second one is the uniform random case (7th case) and the 15th one is a close to corner like situation.These 2D data simulations were repeated with different σ-s, 0.05; 0.10; 0.20 for the n ¼ 9 case.We performed calculations also for n ¼ 16 and n ¼ 25 in 2D.
We simulated data also for the case of 4 independent variables.These 4D data scanned setups from the centre like arrangement of data positions through the uniform random case to the corner like case, where corner means of course the corners of the 4D simplex.The number of x-y pairs was n ¼ 16 in these calculations.
All calculations were augmented with fix experimental design allocations, in the case of the n ¼ 9 case a compositional design was applied as it can be seen in Fig. 1d.In the case of the n ¼ 16 and n ¼ 25 2D sets, equidistant 3 and 4 level full factorial designs were used as fix DOE values.In the 4D case a full factorial fix design was applied at two levels.For all fix DOE cases we generated 1000 data sets by adding random white noise (Eq.( 1)).We should mention that we performed calculations on datasets with heteroscedastic error, where some realistic functions (e.g.error proportional to the response function) were used.In order to limit the size of our manuscript we do not discuss those results here.
Similar calculations were performed on real datasets to select 'representative' datasets.There was some difference during the biased sampling.At first, we created a random x according to the method described earlier.We calculated the distance of this random point from the total data centre.Then we searched the data point from the real sample where the scaled distance from the sample centre was the most similar and the given data was partitioned to the same part of the space.The latter was calculated by the sign pattern of the centred predictor matrix.It was necessary, because the real data sets do not span uniformly the space and there were very limited number of points at some regions.
We had to add this extra criterium to avoid dimension reduction of the investigated topic.

Power plant data
The powerplant benchmark set holds nearly 10k records of a fully loaded working plant over a range of six years [25].The dependent variable is the electrical energy output, the independent ones are temperature, pressure, relative humidity and exhaust vacuum.Multivariate linear regression using all variables is adequate for predicting the electrical energy output.

QSAR data
In the paper of Gramatica et al. [26] the EC 50 values of 35 triazole or benzo-triazole derivatives were available.The authors of that work proposed 3 sets of variables containing 3 variables each for linear modelling in which the aforementioned EC 50 values are treated as a vector of responses.This gave us three data matrices X i -y of a given set of variables bound columnwise to y with i ¼ 1, 2, 3.In all three cases we executed linear modelling and acquired three sets of regression coefficients.Now, Gramatica et al. calculated molecular descriptors for 369 more compounds for which no experimental responses were available.This gave us three more predictor matrices denoted X iR , i ¼ 1, 2, 3. Using the three models developed by us formerly we could predict these unknown responses for each matrix X iR , gaining vectors ŷiR, i ¼ 1, 2, 3.
Out of the above we acquired three datasets with y-vectors independent of X by averaging the y-values of a given pair of vectors, ŷiR and ŷjR , and assigning the resulting vector to X kR producing the datasets:

Parameters
For external validation we used independent test sets with the same sample size as that of the training set.For example, for n ¼ 9 we used n test ¼ 9.In the case of the data presented here, we used randomly selected test sets without any preferred allocation pattern.In this way we were able to compare the test results in an unbiased way.In some cases, we check the use of test sets that have been sampled in a biased manner.As it is well known, it increases or decreases the value of the validation parameters in a predictable way.Here we were not interested in 'validating' the models by tuning the allocation of the test sets.
We calculated 41 different validation and optimality parameters whereof the ones shown here are detailed in the Appendix.The first one was related to our data selection method, it was the average Euclidean distance of the independent variables from the set centre (d av ).Its minimum was close to zero and the maximum was the half of the largest diagonal of the simplex at the given dimension, e.g.0.707 in 2D.The next ones were basic validation parameters as the internal goodness of fit measures like R 2 , RMSE and CCC, internal robustness measures like the leave one out Q 2 and CCC loo , external ones like the Q 2 F1 -Q 2 F3 family, CCC test and RMSE test , some auxiliary variables as RSS and TSS both for training and test sets [23,24].The other part of the calculated values was connected to experimental design.At first, we used some packages of R [2] but we observed some inconsistency in the literature [1,2,7].Finally, we omitted the redundant ones, e.g.where a normalization or square root calculation was the only difference.We show only 6 optimality parameters out of the initial 10 calculated for the training sets.This reduction was possible, since the mathematical relations mentioned earlier changed the correlation of the variables but did not change the results, if we used rank correlations.It was feasible to use rank correlation according to the nonlinearity, anyway.For example, we did not calculate R 2 adj values due to its strict scaling relationship to R 2 .In all cases the variance of the slopes and the intercepts were recorded as well.The sum of these variances is the target of the least squared estimation, where we are interested in a model as a result of which the variance of the parameters is the smallest.In the case of our simulated data we were able to calculate for all fitted models the exact integrated error as well (IQ, see appendix).Here we knew both the exact model and the fitted one.
On most of our figures we show the averages over the 1000 trials at a given level of bias in allocation with respect to d av .In the rank correlation and best set overlap calculations the individual data were used without any averaging.

Results and discussion
We studied several systems whereof we selected the 2D n ¼ 9 and σ ¼ 0.10 case to be shown primarily.Here the allocation of the independent variables can be visualized easily as shown in Fig. 1.Thereafter we discuss the effect of the error, the number of data points and the similarities of the 2D-4D independent variable cases.Our discussion ends with the real data.

Allocation aspect in the simulated
The dependence of R 2 and CCC with respect to d av , the average distance from the sample centre is shown in Fig. 2a.For the most centre-like allocation the average R 2 was 0.461, for the uniform random sampling it was 0.949 and for the most corner-like allocation it was 0.978.The same change was observed for CCC: 0.593-0.973-0.988.Both validation parameters monotonously increased within the statistical uncertainty at coming near to corner-like allocations.The two dots represent the averaged values of the fixed compositional design of experiments in Fig. 2.Here R 2 ¼ 0.980 and CCC ¼ 0.990.It means, the fixed compositional design provided the best results.The centre like allocation can be a typical one only when an experimental measurement at the borders of the accessible range are more complicated or expensive than to use corner-like allocations.The numbers showed that it is not easy to validate such a situation.The results of the uniform random allocation were still at the border of the usual limit (R 2 ¼ 0.95), while the close to corner biased sampling gave well validatable data.The basic approach of a trained analyst, the fixed design gave the best results.Neither a referee nor an authority would claim that the sampling is biased in the case of a fix DOE, despite that in the reality it is a highly tuned (overoptimistic?) approach.
The different parameters are shown with respect to d av , in Figs. 2 and 3.This choice was arbitrary and based on the generation of the simulated data, because any other values related to the spread of the independent variables might be used here.Total sum of squares or variances of the predictor values are adequate as well, but we prefer to the average distance.In the case of a squared measure on the x-axis, the most of the 15 steps bias will be clustered on the right side of the graph, while using d av they provide more balanced distribution along the x-axis.Neither we use TSS as x-axis, because this is a response variable property.In the case of the simulated data, where the independent variables had same weights, TSS depends close to monotonically to the independent variables, but there is a bias due to the added white noise.It is known and widely used in tuning by allocation, that R 2 can be improved by increasing TSS in linear modelling, especially in our case, when the error was set homoscedastic.
RMSE, the third internal validation parameter for goodness of fit, did not show any significant dependence on the allocation (Fig. 2b).The data generation was rather smooth, the added white noise did not provided outliers.On contrary, the sum of the variances of the fitted parameters, var p , clearly decreased from the centre like to the corner like allocations (2.576-0.054-0.052).Here the minimum was not at the most corner like situation, it was 0.030 at a close to corner like one.The fixed compositional design is even better, var p ¼ 0.018.
In the case of the robustness parameters Q 2 ¼ -0.47 for the centre like allocation (Fig. 2c).It means, a simple constant model was significantly better in the leave one out case and this allocation gives totally non-robust models.The uniform random sampling provided 0.77 and the in the most corner like arrangement Q 2 ¼ 0.87.The highest value of this metric was achieved by a close-to-corner an allocation denoted with a cross in Fig. 2, Q 2 ¼ 0.93.This value was still smaller than that of the compositional design, Q 2 ¼ 0.95.The CCC loo values were similar, the centre like allocation provided negative leave one out CCC.We mention here, that the well-known change in Q 2 is mostly caused by the increasing TSS, similarly to R 2 .
The external validation parameters show the predictivity of the models.The test set contained n test ¼ 9 independent data sampled in a uniform random way.Q 2 F2 is the equivalent of R 2 calculated on the test set instead of the training.In Q 2 F1 the training mean is used, and it is a feasible alternative to Q 2 F2 , when the test set is significantly smaller than the training one.It reduces the uncertainty coming from the weak estimation of mean on small samples.Q 2 F2 is preferred by some authors [27].
F3 is a special one, where the total sum of squares (TSS) of the training is used [28].Therefore, the values of this metric cannot be tuned by test allocation.Here the training allocation was tuned and the test was not, therefore the behaviour of Q 2 F3 simply reflects the training tuning.In Fig. 2d the results are similar to those in the previous figures: the centre-like allocation models performed very weakly, questioning any modelling.The unbiased random training provides Q 2 F -s from 0.88 to 0.90.The best biased sets were 0.92, 0.89, and 0.95 for the three Q 2 F -s, while they were 0.92, 0.90 and 0.97 for the fixed compositional designs.
The dependence of the optimality parameters on the centre-like to corner-like allocation shift is shown in Fig. 3a.The trends were close to monotonous for all cases except G-optimality, where a close-to-cornerlike allocation was the best one.The fixed composite design of experiment was better in the cases of A-, E-and G-optimality parameters and for the condition number.
In the case of our simulated data we knew the exact model: and the fitted equation The predictivity of the model (IQ) in the full data range can be calculated analytically (see appendix).IQ behaved similarly to the validation parameters (Fig. 3b).The best biased allocation was again the close to corner case, while the compositional fixed design was even better.
If we summarize our results up to here, R 2 , CCC and the optimality parameters were the better the more the allocation was close to the corners.In the case of other values as Q 2 , CCC loo , var p , IQ, Q 2 F1 , Q 2 F2 and Q 2 F3 , close-to-corner-like allocations were superior to the simple cornerlike ones.In our model calculations only the RMSE was unsensitive to the bias introduced into the sampling.Surprisingly, in all cases the fixed compositional design performed the best.The purpose of being interested in allocation is generally to get better models and it needs some plausible explanation or some numerical target to define a better model.For most of the modelling the term 'efficient estimation' is used and model parameters with the smallest variance are targeted.Since the magnitude of var p is model and case-dependent, many people like comparable numbers as R 2 , where e.g. a limit of 0.95 can be set.For the user model predictivity is the most important.As it can be seen, we may have different target functions to optimize.Generally, we do not optimize them directly, but we use some other variables to get near the optimum.Design of experiments defines several of these optimality parameters.
Correlation coefficient is an efficient metric for describing the link between two parameters, but if the two parameters depend not linearly, it is better to use rank correlation.Rank correlation works perfectly as long as the changes are monotonic in both variables.The rank correlation among R 2 and some selected validation and optimality parameters are shown in Fig. 4a.The absolute values of the rank correlations are shown.Most parameters rank correlated well with R 2 .The quantities indicated by red dots are the ones, which can be calculated purely on X, while the blue ones can be calculated as functions of the model parameters.R 2 rank correlated extremely well with Q 2 and var p .The optimality parameters also performed well.
Optimization means usually a cross optimization.We optimize our model for a given DOE or validation parameter, but we hope that most of the parameters improve during the process.According to the reasonable rank correlations one may think that e.g.D-optimal solution provides excellent R 2 values.If we check an algorithm of a cross optimization, we may arrive to a conclusion, that a cross optimization is effective, if the best subset belonging to the control variable overlaps with the best subset of the other parameter.This overlap ratio is shown for the best 1%-s of parameter pairs in Fig. 4b, where one set is always the R 2 best 1% of the allocations.Surprisingly, here the overlap was rather small for most of the variables, especially for the optimality-related ones.Even the uncontrollable RMSE performed better than these, that indicates the high effect of the random selection of differently erroneous data points.Fig. 6. a) Distribution of the scaled independent variables for the three QSAR datasets.In the legend G1, G2 and G3 mean the three sets and the second numbers refer to the three-three variables.b) Overlap ratio of the best subsets between the sum of the fitted parameter variances (var p ) and other validity and optimality parameters for the three QSAR dataset.
Feasible cross optimization seemed to be possible only for Q 2 , CCC and CCC loo .The overlap to var p was also notable.
The same graphs are shown for the total variance of the model parameters in Fig. 4c-d.The trends and the figures are rather similar.Reasonable rank correlation was observed for most of the parameters, but the overlap between the best sets was only significant now for R 2 , Q 2 , CCC and CCC loo .The RMSE can be interpreted again as an uncontrollable error factor.The overlap was extremely small with all the external validation parameters and with many optimality ones.
In Fig. 4e-f the graphs are shown for the rank correlations and overlaps with the integrated predicted error, IQ.This quantity cannot be calculated in real cases, it is accessible only for simulated data.The rank correlations were again promising, while the overlaps were more than disappointing, see the vertical scale in Fig. 4f.The overlap was between the 1% bests of the two subsets.0.01 means independence in statistics and the best 0.04 means only a 4 times enrichment.Summary: the predictive error cannot be cross optimized using other variables of ours.

Effect of n, σ and dimensionality
The above results questioned the optimistic attitude for design of experiment and the belief that the general validation parameters are as adequate as they are thought to be.It seems so, that the integrated prediction error is rather sensitive to the error content of the given set and it may mask all the basic advantages of tuning the allocation.Therefore, we performed two sets of calculations with different white noise ratios, one with σ ¼ 0.05 and another with σ ¼ 0.25.Here we only summarize our results.The trends in the validation parameter changes with respect to d av remained the same, the corner-or the close-to-corner-like allocations performed the best from the series.The compositional design provided the overall best results.The rank correlations showed exactly the same preferences while the overlap of the best sets was as disappointing as in the original case.The small differences seemed to be within the uncertainties of the data and no clear trend was observable in the σ ¼ 0.05-0.1-0.25 series.
The 2D simulations were repeated with n ¼ 16 and n ¼ 25 sample sizes.The results were close to identical to the basic setting, there were only a few changes.Increasing the number of data points shifted the best performance towards the corner-like allocation from the close-to-cornerlike one.The 4-5 level full factorial fixed designs did not perform better than the best allocations from the series.Thirdly, the G-optimality performed similarly as the other optimality parameters for the large sample sizes.All other trends, especially the rank correlations and the overlap of the best subset remained unchanged.We still did not find any possibilities to cross optimize IQ.
In the case of 4 independent variables we used 16 sample points.The best performance in the series shifted again to the corner like arrangement, but here the full factorial two level fixed design was superior in all cases.The rank correlations showed the same trends and cross optimization seemed to be possible only within the Q 2 , R 2 , var p , CCC, CCC loo group.

Power plant dataset
The dimensionality of this dataset was the same as our last simulated case, but here was some correlation among the independent variables and some parts of the variable field did not contain any data.It was also not possible to calculate IQ and to apply a fix design, since the data accumulation was not a designed experiment but a recording of an operation.The generic trends in the validation and optimality parameters with respect to d av were similar to the simulated 4D data.In Fig. 5a and c the rank correlations for R 2 and var p are rather similar to those observed on our basic simulated data.The values were promising.The main difference for this data was only the better performance of some optimality parameters, e.g A, E and the condition number (CN) had some significant common subset with var p .There is a possibility to cross optimize these metrics with var p , unfortunately, with only a 0.3 efficiency (Fig. 5b and  d).Here the R 2 -var p pair is less correlated than in the earlier examples.

QSAR datasets
There was a difference between the QSAR and the previous data.Here the independent variables spanned the variable space rather non uniformly as it can be seen in Fig. 6a.Most of the variables were crowded at different regions, e.g., at the small values, at large values or at both ends.These sets were full with x-outliers, where several of them could be termed as good or bad leverage points.If we used large number of data points in the sample, the linear regression provided a reasonable model.Here we used small number of data points in one sample, n ¼ 14 was applied to determine the three slopes and one intercept.
In this way a large uncertainty was expected, especially, for the robustness and predictivity parameters.As it can be seen in Fig. 7, the Fig. 7.The dependence of R 2 , Q 2 and Q 2 F2 for the three QSAR datasets (G1, G2, and G3) The x-axis is with respect to the 15 biased steps of allocation sampling, from left to right the trend is from sample mean allocation to corner like one.
previously clear trends changed here.Even our biased sampling did not provide monotonic trends for d av , because several parts of the variable space were rather rarely populated.R 2 showed a clear trend of taking higher values for the more corner-like arrangement, but it was partly an artefact of the increasing TSS term.The Q 2 -s showed different results, the centre to corner bias caused no clear trend in set G1, it improved the data for set G2 and it was bad anyway for set G3.The differences were caused by the different distributions of the three triplets of variables of the respective sets (cf.Fig. 6).The vertical scale of the graphs is [0,1] and it was not enough to show the trends of the predictivity parameter Q 2 F2 .In the case of set G1 it improved towards the corner allocation.In the case of set G2 the prediction power was negative for the centre like allocations and weak 0.2-0.3values were obtained for the corner like allocations.For set G3 the trend was the opposite.Fig. 7 clearly showed that using 14 data points was not enough for this modelling but playing with the allocation may mask the problem, e.g., in the case of set G1 at corner like allocations.
If we check the overlap of the best 1% subsets between the sum of parameter variances and the other values (Fig. 6b), there was a usable overlap for cross optimization only for R 2 , Q 2 , CCC, and CCC loo .The RMSE value, that is an uncontrollable parameter here showed the best overlap similarly to the simulated data (cf.Fig. 2), especially for set G1 and set G2.This means, that if we do not have any information on the error of each data point to estimate RMSE a priori for a given allocation, our R 2 or Q 2 base optimization performs weakly.We note here, that in the case of clustered independent variables or inhomogeneous data, we may use local modelling (e.g., piece-wise regression) for the cluster.

Conclusions
The allocation of the predictor variables changed from a centre-like arrangement to corner-like ones in our biased generation of data or subset selection.This change seemed to be a good control variable to get trends in validation and optimality parameters.In the case of R 2 and CCC (goodness of fit parameters) the most corner-like allocations provided the best results for the 2D and 4D simulated data.In the case of sum of fitted parameter variances (var p ), Q 2 , CCC loo , and Q 2 F2 the close-to-corner-like allocations were the best ones, they had a slight improvement over the ones forced closer to the corners.RMSE did not change with respect to the allocation bias, it was maybe a result of the perfect homoscedasticity of the error in the simulated data.The trends did not depend on the magnitude of the white noise added to the simulated data.The fix design of experiments provided even better validation parameters in the cases, where the most points were at the edges of the predictor domain.If several midpoints were fixed (n ¼ 16 and n ¼ 25 cases in 2D), the fixed designs were not the best ones.This means, that choosing a low number of fix design points seems to be feasible, as long as nobody questions how biased are these often suggested allocations.
In the case of real data, the power plant data set provided similar results.In the case of the QSAR data sets relatively small subsets were applied, where the performance parameters deteriorated drastically, especially for robustness and prediction power.The inadequacy of the model was the result of the unbalanced distribution of the predictor values.Here the trends were valid only for R 2 and CCC, for the other parameters, including var p , the trends strongly depended on the error features of the individual subsets.
To summarize, corner-like and close-to-corner-like arrangements provide good validation parameters in the cases, where the predictor matrix fills the variable field evenly, without large gaps.
Most of the optimality parameters showed the same dependence on the bias effecting the allocation.It caused very good rank correlation among most variables, irrespectively of them being validation or optimality ones.Surprisingly, it did not impose a good cross-optimization perspective.In most of our datasets, the best optimized 1% of the data for optimality and validation parameters overlapped weakly.Effective cross optimization seemed to be possible among the R 2 , CCC, Q 2 , CCC loo and var p parameters.In the case of the simulated data we were able to calculate an integrated error in prediction.This IQ showed rather disappointing overlap for the best 1% subsets with those of the other parameters.
Allocation is under consideration in two steps of experimental model building.One of them is before the experiment where design of experiment takes place.The second one is at the end of the modelling, during validation.Unfortunately, our investigation on the overlap of the best subsets showed, that there is a lack of clear-cut trends among the optimality and the validation parameters in practice.Design of experiment using optimality parameters need not to provide the allocations with best validation parameters in all cases.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.The X predictor matrix includes 1-s in the last column for allowing intercept in the regression performed in matrix formalism.

Fig. 1 .
Fig. 1.Schematic distribution of the independent variables in the generated samples from a) the centre like allocation through a b) uniform allocation to a c) corner like allocation.d) fixed compositional design for n ¼ 9 in 2D.

Fig. 2 .Fig. 3 .
Fig. 2. Dependence of the parameters on the allocation.d av is the average distance of the data points from the sample centre.The vertical line shows the uniform random sampling.xεR 2 , n ¼ 9, σ ¼ 0.01.a).R 2 and CCC.b) RMSE and sum of fitted parameter variances.c) Q 2 and CCC loo .d) external validation parameters for predictivity, Q 2 F1 , Q 2 F2 and Q 2 F3 .

Fig. 4 .
Fig. 4. On left) Absolute value of the rank correlations between a given parameter and other parameters.Parameters indicated in red can be calculated a priori the modelling while for the blue ones the fitted model has to be known.On right) Overlap ratio of the best 1% subsets for a given parameter and other ones.Top row) R 2 Mid row) var p Bottom row) integrated predicted error (IQ).(For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)

Fig. 5 .
Fig. 5. Power plant dataset On left) Absolute value of the rank correlations between a) R 2 c) var p and the other parameters.Red denoted parameters can be calculated a priori the modelling while for the blue ones the fitted model has to be known.On right) Overlap ratio of the best 1% subsets for b) R 2 and d) var p with the other parameters.(For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.) Total sum of squares (þTSS test ) Concordance correlation coefficient on the training set (þCCC loo , þCCC test ) À yÞðb y i À yÞ P n i¼1 ðy i À yÞ 2 þ P n i¼1 ðb y i À yÞ 2 þ nðb y À yÞ Root mean square error (þRMSE loo , þRMSE test )where var(p i ) is the variance of the i-th fitted parameter (slope or intercept here).Integrated error (only for simulated data, equation is for 2D case):ZZ ðf modell À f exact Þ 2 dx 1 dx 2