Design of Scoring Models for Trustworthy Risk Prediction in Critical Patients

Prediction of an adverse health event (AHE) from objective data is of great importance in clinical practice. A health event is inherently dichotomous as it either happens or does not happen, and in the latter case, it is a favourable health event (FHE). In many clinical applications, it is relevant not only to predict AHEs happening (diagnostic ability) but also to estimate in advance their individual risk of occurrence using ordered multinomial or quantitative scales (prognostic ability) such as probability. An estimated probability of a patient’s outcome is usually preferred to a simpler binary decision rule. However, models cannot be designed by optimising their fit to true individual risk probabilities because the latter are not intrinsically known, nor can they be easily and accurately associated with an individual’s data. Classification models are therefore usually trained on binary outcomes to provide an orderable or quantitative output, which can be dichotomised using a suitable cut-off value. Model discrimination refers to accurate identification of actual outcomes. Model calibration, or goodness of fit, is related to the agreement between predicted probabilities and observed proportions and it is an important aspect to consider in evaluating the prognostic capacity of a risk model (Cook, 2008). Model calibration is independent of discrimination, since there are risk models with good discrimination but poor calibration. A well-calibrated model gives probability values that can be reliably associated with the true individual risk of outcomes. Many models have recently been proposed for diagnostic purposes in a wide range of medical applications and they also provide reliable estimates of individual risk probabilities. Two different approaches have been used to predict patient risk. The first approach is based on estimation of risk probability by sophisticated mathematical and statistical methods, such as logistic regression, the Bayesian rule and artificial neural networks (Dreiseitl & Ohno-Machado, 2002; Fukunaga, 1990; Marshall et al., 1994). Despite their great accuracy, these models are unfortunately not widely used because they are hard to design and call for difficult calculations, often requiring dedicated software and computing knowledge that doctors do not welcome, besides being difficult to incorporate in clinical practice. The second approach creates scoring systems, in which the predictor variables are usually selected and scored subjectively by expert consensus or objectively using statistical methods (den Boer et al., 2005; Higgins et al., 1997; Vincent & Moreno, 2010).

. A probability threshold value, P t , is identified for classification, over which AHE is recognized to occur, that is when P(AHE|x)>P t ; the choice of P t depends on the clinical cost of a wrong decision and influences model classification performance (E.Barbini et al., 2007).2. S c o r e m o d e l s e v a l u a t e r i s k b y a d i s c r e t e s c a l e o f n p o s i t i v e i n t e g e r v a l u e s s i (i = 0, 1, 2, ..., n) which includes zero to represent null risk, but rarely provides a threshold value for classification purposes (Cevenini & P. Barbini, 2010;Vincent & Moreno, 2010).

Discrimination and calibration
Whatever the risk model, its prediction power is generally expressed by discrimination and calibration (Cook, 2008;Diamond, 1992).Discrimination is the capacity of a classification model to correctly distinguish patients who will develop an adverse outcome from patients who will not.It must be optimized during model design by ascertaining that the model learns all the discrimination properties valid for the population, correctly from the training sample and therefore shows similar performance in different samples (generalisation ability) (Dreiseitl & Ohno-Machado, 2002;Vapnik, 1999).Though many criteria exist for evaluating model discrimination capacity (Fukunaga, 1990), sensitivity (SE) and specificity (SP), which measure the fractions of correctly classified sick and healthy patients, respectively, are commonly used for statistical evaluations of binary diagnostic test performance.SE end SP are combined in the receiver operating characteristic (ROC) curve which is a graphic representation of the relationship between the true-positive fraction (TPF = SE) and false-positive fraction (FPF = 1-SP) obtained for all possible choices of P t .The area under the ROC curve (AUC) is the most widely used index of total discrimination capacity in medical applications (Lasko et al., 2005).Calibration, or goodness of fit, represents the agreement between model-predicted and true probabilities of developing the adverse outcome (Hosmer & Lemeshow, 2000).
Retrospective training data only provides dichotomous responses, that is presence or absence of the AHE, so true individual risk probabilities cannot intrinsically be known.
The only way to derive them directly from sample data is to calculate the proportion of AHEs in groups of patients, but this obviously becomes less accurate as group size decreases.Nevertheless, from a health or clinical point of view, it is often useful to have an estimation of the level at which each event happens, using a continuous scale, such as probability.For probability models with dichotomous outcomes, calibration capacity can be evaluated by the Hosmer-Lemeshow (HL) goodness-of-fit test, based on two alternative chi-squared statistics, Ĥ and Ĉ (Hosmer & Lemeshow, 2000).The first formulation compares model-predicted and observed outcome frequencies of fixed deciles of predicted risk probability; the second compares by partitioning observations into ten groups of the same size (the last group can have a slightly different number of cases) and calculating model-predicted frequencies from average group probabilities.The Ĉ-statistic is generally preferred because it avoids empty groups, although it depends heavily on sample size and grouping criterion (den Boer et al., 2005).The HL test cannot really be applied to models with discrete outputs, such as score systems, because group sizes should themselves be adjusted on the basis of discrete values (Finazzi et al., 2011).
Calibration can be improved, without changing discrimination capacity, by suitable monotonic mathematical transformations of model predicted probabilities (Harrell et al., 1996).The mean squared error between model predicted probability and observed binary outcomes is sometimes calculated as a global index of model accuracy, and has been demonstrated to incorporate both discrimination and calibration capacities (Murphy, 1973).

Generalisation, cross-validation and variable selection
Generalisation is defined as the capacity of the model to maintain the same predictive performance on data not used for training, but belonging to the same population.A high generalisation power is of primary importance for predictive models designed on a sample data set of correctly classified cases (training set).Many different procedures, which involve different correctly classified data sets for testing model performance (testing sets), have been used to control model generalisation (Bishop, 1995;Fukunaga, 1990;Vapnik, 1999).A model generalises when differences between errors of testing and training sets are not statistically significant.
Theoretically, the optimal model is the simplest possible model designed on training data and has the highest possible performance on any other equally representative set of testing data.Excessively complex models tend to overfit, i.e. give significantly lower errors on the training data than on the testing data.Overfitting produces data storage rather than learning of prediction rules.Models must be designed to avoid overfitting and improve generalisation through efficient control of the training process.This control often includes suitable techniques for the selection of predictor variables (Guyon & Elisseeff, 2003).
Computer algorithms for properly controlling overfitting are known as cross-validation or rotation techniques and make efficient use of all available data to train and test the model (Vapnik, 1999).The most common type of cross-validation procedure is k-fold, where the original sample is randomly partitioned into k subsamples, one of which is used as testing set and the other k-1 as training set.The process is then repeated k times, changing the testing set each time so that all subsamples are used for testing.A convenient variant, more appropriate in dichotomous classification, selects each subsample to contain approximately the same proportion of cases in the two classes.When k is equal to sample size, n, the procedure is called leave-one-out.One case is tested at a time at each of the n training sessions using n-1 training cases.Resampling methods also exist, and include bootstrap methods that produce different data samples by randomly extracting cases with replacement from the original dataset (Chernick, 2007).
Cross-validation can be used to compare the performance of different predictive modelling procedures and, specifically, to select different sets of predictor variables with the same model.In fact, it is convenient to select the best minimum subset of predictor variables to control generalisation and to avoid information overlap due to correlation between variables.Computer-aided stepwise techniques are usually used to obtain optimal nested subsets of variables for this purpose.To train the model, a variable is entered or removed from the predictor subset on the basis of its contribution to a significant increase in discrimination performance (typically the AUC for dichotomous classification) at each step of the process.The stepwise process stops when no variable satisfies the statistical criterion for inclusion or removal (Guyon & Elisseeff, 2003).

Probability models
We now provide an overview of four approaches for estimating AHE risk probability: the Bayesian classification rule (Lee, 2004), k-nearest neighbour discrimination (Beyer et al., 1999), logistic regression (Dreiseitl & Ohno-Machado, 2002;Hosmer & Lemeshow, 2000), and artificial neural networks (Bishop, 1995;Dreiseitl & Ohno-Machado, 2002).Linear and quadratic discriminant analyses and related Fisher discriminant functions were not considered because they are strictly classification methods, and although they also enable easy derivation of prediction probabilities, they have been demonstrated to be equivalent to Bayesian methods (Fukunaga, 1990).

Bayesian classifiers
Bayes's rule allows the posterior conditional probability of AHEs to be predicted as follows (Lee, 2004) Statistical assumptions are usually made about whether CPDFs have parametric or non parametric structure.In many cases they are assumed to be of the parametric Gaussian type, because this has been proven to provide good discrimination performance, especially if a subset of predictors can be optimally selected from a large set of clinically available variables (E.Barbini et al., 2007;Fukunaga, 1990).

K-nearest neighbour algorithms
The k-nearest neighbour algorithm is among the simplest non parametric methods for assigning patients based on closest training examples in the space of features x (Beyer et al., 1999).Euclidean distance is usually used to measure between-point nearness but other metrics must be introduced if non continuous variables are considered.In our binary classification scheme, the training phase simply consists in partitioning feature space into the two regions or classes, AHE and FHE, based on the positions of training cases.
Each new patient is assigned to the region in which the greatest number of its k neighbours occurs, where k is of course a positive integer.
With two classes, it is convenient to choose an odd k to avoid situations of equality.Typically, the choice of neighbourhood size depends on the type and size of the training set; larger values of k generally reduce the effect of noise on classification at the expense of distinction between classes.Heuristic techniques are used to obtain the optimal value of k.A common choice is to take k equal to the square root of the total number of training cases, but cross-validation methods, such as bootstrap, are often preferred.Although k-nearest neighbour is not strictly a probability method, it has been demonstrated that the fraction of k neighbourhood training cases falling in the AHE region is a good estimate of class-conditional risk probability (Beyer et al., 1999).

Logistic regression
Logistic regression is perhaps the most popular method for estimating risk probabilities in the medical field (Hosmer & Lemeshow, 2000).Logistic regression is a variation of ordinary regression: it belongs to the family of methods called generalized linear models, which include a linear part followed by some associated function.It can be considered a predictive model to use when the dependent response variable is dichotomous and the independent predictor variables are of any type, i.e. continuous, categorical, or both.In d-dimensional feature space, the form of the model is: where "log" is the natural logarithm function, x k (k = 1, 2, …, d) the observation data set and c k (k = 0, 1, 2, ..., d) regression coefficients estimated from training data using maximum likelihood criteria.The inverse of eq. 2 allows the posterior probability of AHE risk, P(AHE|x), to be modelled by a continuous S-shaped curve, even if all predictor variables are categorical.The argument of the logarithm of eq. 2 defines the probability of the outcome event occurring divided by the probability of the event not occurring and is known as the odds ratio.When it is specifically associated with dichotomous predictor variables (risk factors), it is a useful measure of the relative risk due to single risk factors.The reliability of logistic regression results is affected by linear correlations and interaction effects between predictor variables, dependence between error terms, and especially outliers.

Artificial neural networks
Artificial neural networks (or simply neural networks) are mathematical models miming the physiological learning functions of the human brain.They can be designed and trained to create optimal input-output maps of any physical or statistical phenomenon, the relationships of which may even be complex or unknown.They do not require sophisticated statistical hypotheses and account for all possible interrelations between predictor variables in a natural way.In this sense, neural networks can be considered universal approximators (Bishop, 1995).
A preliminary definition of network architecture is needed and should include number of neurons, number of layers, number and type of connections among neurons, type of neuronal activation functions and so on.Learning is the trickiest phase of neural networks: it consists of estimating network parameters (connection weights and activation thresholds) iteratively from training data, to minimize error between actual and model-estimated outputs.Feed-forward neural networks can be designed to directly estimate class-conditional posterior probabilities from predictor variables, without requiring sophisticated statistical hypotheses.Their architecture can be variably complex, but should provide one output neuron with a logistic sigmoid activation function, generating an output between 0 and 1. Neural networks have been demonstrated to provide reliable estimates of classconditional posterior probabilities, such as the AHE risk probability, P(AHE|x), that is (Bishop, 1995): where f is a linear function of n neuron inputs u k (k = 0, 1, 2, ..., n), originating from the outputs of n preceding connected neurons, the parameters of which are connection weights, w k , and neuron activation bias, b.Under-learning can lead to high prediction errors, whereas over-learning can cause overfitting which produces loss of generalisation.Artificial neural network design is therefore anything but simple.Experience is necessary to manipulate heuristic procedures for suitable definition of network architecture and to correctly use iterative numerical training techniques that stop learning when the network begins to overfit.

Direct score model
A scoring model is a formula that assigns points based on known information, in order to predict an unknown future outcome.Many integer score systems have been designed for clinical application to critical patients.The most popular were derived from simplification of any of the above probability models by rounding their parameters to integer values.In particular, many approximate the coefficients of logistic regression models to the nearest integer values (Higgins et al., 1997).We do not dwell on the methodology of these score models here, directing readers to the specialised literature (Vincent & Moreno, 2010).Our main interest is to identify score values that give reliable probabilities of individual risk for prognostic purposes.We discuss on the design of a very simple score system that we call a "direct score model".We also provide a correct and useful statistical interpretation of model prognostic capacity, which can easily be extended to any other score model, even more sophisticated ones (Cevenini & P. Barbini, 2010).

Model design
Only binary predictor variables (risk factors) are used in this score model.The automatic computer procedure and model training is described by the following steps:  All quantitative predictor variables are dichotomised by ROC curve analysis, identifying cut-off values giving equal sensitivity and specificity in relation to adverse outcomes.


Risk factors over or under the cut-off value are coded 0 or 1, depending on whether the risk of AHE decreases or increases, respectively.


The odds ratio of each binary variable is evaluated on the basis of the corresponding confidence interval (CI) (Agresti, 1999): variables with odds ratios not significantly greater than 1 are discarded.


A forward iterative procedure is applied to a data sample (training set) which sums selected binary variables stepwise.


All binary factors are reconsidered at each step, so that multiple selection of one factor gives rise to a multiple integer contribution to the score.


At each step the risk factor providing the highest increment to AUC is included.


Training is stopped when the cumulative increment in A U C o b t a i n e d i n f i v e consecutive steps is less than 1%.This rather soft stopping criterion is used instead of well-established statistical methods (Zhou et al., 2008) to avoid selecting too few predictors, which reduces the possibility of associating an effective probability of AHE with each integer score. A testing dataset of the same size as the training set is used to evaluate model generalisation and to guide conclusive selection of the optimal predictor set.Backward sessions and cross-validation trials cannot be applied because the model is nonparametric.Optimal model selection is carried out by a step-by-step analysis of model prognostic and diagnostic power.At each step w, the conditional probability of the adverse outcome (prognostic risk probability), P w (AHE|S k ), associated with each k th integer score value S k , is estimated from sample data as the ratio of adverse events to the total number of events determining a model score S k .The bias-corrected and accelerated bootstrap method is applied to estimate 95% CIs of P w (AHE|S k ) using 2000 bootstrapped samples.This method makes it possible to infer complex statistics that are difficult or even impossible to represent mathematically and have proven to be theoretically and practically more accurate than other bootstrap methods (Cevenini & P. Barbini, 2010;DiCiccio & Efron, 1996).By graphic inspection of results, the convenience of grouping close scores having large 95% CI because of excessively low data frequencies is considered.The model is chosen to correspond to the iteration providing the largest number of score values or classes having sufficiently narrow and separate 95% CIs with respect to the training data, and at the same time giving testing-data probabilities falling within their 95% CIs.Once the model is created, the score, S, associated with a generic patient is simply given by: where d is the number of predictors in the model, p i the binary value of the i th predictor, and s i , its model-identified associated score.Finally, model discrimination and calibration performance are compared with a logistic regression model designed on the same training data.All statistical procedures are evaluated at a significance level of 95%.

Simulation
Many realistic simulation experiments are carried out to validate and optimise model design.Predictor variables are all taken in binary form, skipping the dichotomisation of continuous variables.In particular, we consider d dichotomised binary predictors, obtaining n = 2 d different combinations of these predictors.Each combination identifies one value of a discrete variable x j = j/n (j = 0, 1, 2, ..., n-1) ranging from 0 to 1.In this way two different beta probability density functions can be associated with adverse and favourable outcomes.
Beta distribution is particularly suitable for representing multinomial phenomena, such as that described by the above n discrete values.In detail, we refer to the discrete probability distribution of a multinomial variable x, the probability values of which are calculated using a beta probability density function.Figure 1 shows an example with two different choices of the beta probability density function shape parameters,  and , to simulate healthy and sick subjects.When the classconditional probability density functions of a two-class classification problem are known, the highest achievable discrimination level is related to the areas of overlap.The lowest error probability of classification, , is given by: where P(C h ) and p(x|C h ) are the prior probability and the class-conditional probability density function for class C h (h = 1, 2), respectively.Prior probability of an adverse outcome, P(AHE), is also known as prevalence, π, and prior probability of favourable outcome, P(FHE), is 1-π.Because of the discrete nature of variable x, in our simulation study, eq. 5 can be approximated as: where  AHE , AHE ,  FHE and FHE are the corresponding shape parameters of beta functions, , related to adverse and favourable outcomes, respectively.Eq. 6 shows that  depends on prevalence and beta parameters.At any iteration w of the above-mentioned stepwise procedure, for any k th integer value of score S k , the simulated "true" conditional risk probability, ) By assuming mutually exclusive x j events, the true class-conditional probabilities are simply obtained from the two simulated beta distributions as the sum of all the discrete probabilities corresponding to the x j values giving the score S k , that is:  Table 1).Training data is not used because the simulation process enables the true probabilities, described above, to be evaluated exactly.All computations are performed using MATLAB code.

Simulation results
The method is illustrated in detail by describing the results of a simulation of the 54 experiments performed.The experiment corresponding to N = 1000, π = 20% and  AHE = 3 is illustrated, because it is similar to an actual clinical condition that will be shown below.
Figure 2 shows the AUC values obtained using the forward selection of model features from simulated training data described above.The stopping criterion arrested the stepwise algorithm at the eleventh step, after 5 out of 12 predictor variables had been selected.In fact, the cumulative increment in AUC was about 0.8% in the last five steps (nos.7-11).The variables are numbered in order of decreasing discrimination power.The most discriminating variable, no. 1, was entered five times (s 1 = 5) in the model, variable no. 2 three times (s 2 = 3) and variables nos.3-5 only once each (s 3-5 = 1).Figure 3 shows the 95% confidence interval of score-associated risk probabilities identified by the bias-corrected and accelerated bootstrap method applied to simulated sample data, from step no. 2 to step no. 9.For each integer score value, the estimated 95% CI is plotted together with the corresponding true probability of AHE (calculated from the beta distribution) and the percentage of cases.The discrimination capacity of the model can be detected at every step by observing the growth of estimated AHE probability with the score, whereas calibration is demonstrated by true risk probabilities (stars), which fall in the corresponding 95% confidence interval of the training data, with the sole exception of certain high scores, where there may be too few cases.Now it is necessary to identify a model that reconciles calibration and discrimination.Excessively simple score models (few steps) have low discrimination power (low AUC) and give inopportunely separated 95% CIs.This can be observed at steps nos. 2 and 3 of Fig. 3 where only three score values (0, 1 and 2) are obtained: CIs between scores have very large gaps, suggesting that finer partitioning of the score axis can be achieved with a larger number of steps.Figure 2 indicates poor discrimination capacity of the scoring system at these initial steps.On the contrary, if too many scores are used, as in steps nos.7-9, the CIs are either too wide or overlap, worsening calibration accuracy.The width of score CIs increases significantly with decreasing observed frequency.For example, at step no. 4, score of 3 has only 16 cases (1.6%) and the corresponding 95% CI is so large that it completely overlaps with the previous score of 2. When the number of cases is even lower, as in step no. 9, where the highest scores of 6 and 7 have four and one cases, respectively, the bootstrap method fails to correctly estimate the CIs and the corresponding scores are totally unreliable in prognostic terms.Hence the need to combine neighbouring scores with too few cases.It is particularly convenient to pool the highest scores, which often have few cases, into a single class having a sufficient data frequency to significantly narrow the 95% CIs.For example, at step no.6 it is useful to pool the last two scores of 4 and 5 into a single class.The pooling of adjacent scores with small data frequency enhances model prognostic reliability, usually with an insignificant reduction in discrimination capacity.From the simulated experiment of Fig. 3, five score classes were identified as a suitable compromise between calibration and discrimination.At any step from no. 6 to no. 9, it is worthwhile combining scores greater than or equal to 4 and leaving the lower scores of 0-3 ungrouped, so as to form five score classes: 0, 1, 2, 3 and ≥ 4. Figure 4 shows the results of pooling the three highest scores of step no.8, which is preferred to the previous steps no.6 and no. 7, because besides having higher discrimination capacity, the pooled class contains a greater number of cases, which narrows the related 95% CI to a greater extent.Just a small gap and a slight overlap can be observed in Fig. 4 between scores of 1 and 2, and between scores of 3 and the class of scores  4, respectively.Step no. 9 and subsequent steps not reported in Fig. 3 are discarded because no improvement can be obtained with respect to step no.8 and CI overlap increases.Indeed, to improve the accuracy of estimates of individual probability of AHE, it could be worthwhile increasing the number of classes, tolerating a greater CI overlap.This can be done by analysing and selecting a step beyond the eighth, where the observed frequency in each class is of course significantly reduced, especially for high scores.Comparison of the results of the three-step model with those of the eight-step pooled model shown in Fig. 4 indicates that the scoring system with five classes effectively fills the gaps between adjacent CIs of the simpler score model.At step no.8, pooling of the highest scores does not significantly influence the discrimination capacity of the scoring system: the estimated AUC decreases slightly from 0.838 (95% CI, 0.781-0.885)to 0.827 (95% CI, 0.777-0.869).
Stepwise logistic regression applied to the training data used for the simulation example, set at statistical significance levels of 95% and 90% to enter and remove variables, respectively, selected the first five binary variables.Figure 5 compares ROC curves of the logistic model and the score model of Fig. 4. The ROC curve of true probability values, calculated from training data using beta distributions and the Bayes theorem, is also plotted (dashed line).AUCs of true data and the logistic model were 0.845 and 0.849, respectively.Figure 5 shows that the score model is close enough to the logistic and true ROC curves.Clearly, discretisation leads to a lower AUC, resulting in underestimation of scoremodel discrimination capacity.In addition, the true and logistic curves are to a large extent within the 95% CI of the score curve.Finally, in real clinical applications, logistic regression often includes continuous variables that may improve discrimination performance.The HL goodness-of-fit test (Hosmer & Lemeshow, 2000) showed good calibration performance of the logistic model (p = 0.751).However, 95% of the training-data errors between model-estimated and true percentage risk probabilities were from about -10.5% (underestimation) and +12.0%(overestimation), revealing similar uncertainty to that of the score model.Table 1 gives the number of score values or classes identified by the same procedure, for each of the 54 simulation experiments.It shows that the number of score classes increases with increasing sample size, prevalence and separation between event classes (decreasing error ).The importance of estimating uncertainty suggests to keep 95% CIs of between-class probabilities separate, or slightly overlapping.This limits the identifiable number of score classes and provides reliable probability estimates.Enlargement and overlapping of 95% CIs and consequent loss of prognostic probability information depends heavily on the data frequency of score values or classes and their rate of AHEs influenced by prevalence.Small samples and/or low prevalence make it necessary to pool neighbouring scores to form classes with a sufficient number of cases to ensure a reliable estimate (narrow CI) of class probabilities.

Low separation
 Simulation experiments suggests grouping scores into classes when frequencies are less than about 3% and 10% of the whole sample for N = 8000 and N = 250, respectively.Only two classes are recognised in the worst condition of minimum sample size (N = 250), minimum prevalence (Π = 5%) and low separation between health events ( AHE = 2).A maximum of seven score-classes is identified in conditions of large sample size (N = 8000), high prevalence and high separation between event classes.Although more score classes could be achieved with greater CI overlap, the cost would be unreliable estimates.
The discrimination of the different simulation experiments is assessed by AUC of true simulated probability calculated using beta functions.Conditions of large overlap between areas of beta functions ( AHE = 2) lead to values of true AUC ranging from 0.72 to 0.75; medium overlap ( AHE = 3) gives AUC values in the range 0.82-0.85and the conditions of greatest separation ( AHE = 5) produce AUCs between 0.92 and 0.95.

Clinical example
The approach was applied to actual clinical data of critical patients in the intensive care unit to evaluate their risk of morbidity after heart surgery.
We used a sample of 1040 adult patients younger than 80 years, who underwent coronary artery bypass grafting and were admitted to the intensive care unit of the Department of Surgery and Bioengineering of Siena University.212 patients developed at least one serious postoperative complication (cardiovascular, respiratory, neurological, renal, infectious or hemorrhagic), corresponding to a morbidity of 20.4% (Cevenini & P. Barbini, 2010, as cited in Cevenini et al., 2007).The data was split randomly into a training and a testing set of the same size (520 cases), with the same number of patients with morbid conditions in each set (106 cases) to avoid misleading bias in the results.Table 2 describes the 15 clinical variables used for score model design, six of which were binary in origin.The other nine continuous variables were dichotomised using cut-off values associated with the point of equal sensitivity and specificity on the respective ROC curves.Three of the resulting 15 binary variables were discarded because their odds ratios of morbidity were not significantly greater than 1.This left a total of 12 variables for training the score model, as in the simulation experiments.This real clinical situation was similar to the simulation experiment with N = 500 and Π = 20% (see Table 1).Consulting Table 1, we expected to develop a score model with 4 or 5 classes, depending on the level of data separation between normal and morbid patients.
Figure 6 shows the stepwise procedure used to select the model variables.After step no.8, AUC values of testing data (dashed line with stars) decreased and diverged from training data AUCs (continuous line with dots).This indicated overfitting that was possible because the criterion used to stop the training procedure was deliberately soft, to allow inclusion of more steps than needed for generalisation.In fact, as previously illustrated in the simulation results, investigation of extra steps can be useful to optimise model prognostic power through score pooling.Steps nos.6, 7 and 8 gave similar prognostic performance, so we chose step no.8, thus obtaining higher discrimination (greater AUC).
A convenient class was formed by pooling scores greater than 3, as shown in Fig. 7.All 95% CIs of adjacent scores or classes were well-separated and all testing score probabilities (stars) fell within their corresponding CIs, thereby ensuring high prognostic reliability of the model.LogCV, used the original continuous variables and the second (LogBV) dichotomised them (see Table 2).The stepwise regression procedure selected ten clinical variables (see Table 2) and provided training-data AUC values of 0.906 (HL test, p = 0.135) and 0.871 (HL test, p = 0.557) for LogCV and LogBV, respectively.Figure 8 compares the ROC curves.The LogCV ROC curve (continuous gray line) showed the greatest discrimination performance, mainly because the model selected many continuous variables (6 out of 10).
Except for the highest specificity values, where the discretisation effect of scoring was more evident, the score model ROC curve (continuous black line) did not differ significantly from that of LogBV (dashed gray line), which was inside the respective 95% CI and close enough to the score-model points.Model scores computed using the testing data gave a ROC curve (dashed black line) not significantly different from the training data curve.Finally, it should be noted that the discrimination performance of logistic models decreased considerably when applied to testing data (ROC curves not reported in Fig. 8): AUCs of logCV and logBV were reduced to 0.879 and 0.826, respectively, thus suggesting a possible overfitting.

Discussion
Many quantitative methods for assessing the health risk of critical patients have been developed in past and recent literature (E.Barbini et al., 2007;den Boer et al., 2005;Vincent & Moreno, 2010).They aim to provide objective and accurate information about patient diagnosis and prognosis.Experience has shown that simplicity of use and effectiveness of implementation are the most important requirements for their success in routine clinical practice.Scoring systems respond well to these requirements because their outcomes are accessible in real time without the use of advanced computational tools, thus allowing decisions to be made quickly and effectively.Many clinical applications can profit from their simplicity.For example, they are often used to suggest alternative treatments and organize intensive care resources, where surveillance of vital functions is the primary goal.
Other important benefits of score models are their easy updating and customisation to local institutions.In fact, because the standardisation of local practices is difficult and patient populations may differ, it is now accepted that predictive models must be locally validated, tuned and periodically updated to provide correct risk-adjusted outcomes.All models suffer from the limitation of foreseeing better future treatments and improving prognosis (den Boer et al., 2005).Even very accurate predictive models, when exported to clinical contexts different from those in which they were designed, have often proved unreliable (Murphy-Filkins et al., 1996).Appropriate design and local customisation of excessively sophisticated models is often easier said than done, especially in health centres where there is little technical expertise in developing models that can generalise, i.e. preserve their predictive performance on future data.On the contrary, simple score models can easily and frequently be updated to learn from new correctly-classified cases and are quite tolerant to missing data.This is very useful in clinical practice where data is usually scarce and training on as much available data as possible is of fundamental importance (Cevenini & P. Barbini, 2010, as cited in P. Barbini et al., 2007).
A major problem with score models is that they are difficult to calibrate, i.e. associate reliable estimates of prognostic risk probability with each score.Nevertheless, correct estimation of individual probability of adverse outcome for hospitalized critical patients is useful for prevention, treatment and quantification of health problems and costs.It can help experienced physicians to improve clinical management by optimizing the monitoring of patient status and enhancing the quality of care, and allow new generations of doctors to be better trained during postgraduate specialization and internship.Moreover, reliable knowledge of risk factors and their impact on clinical course and future quality of life can encourage public health policy for risk reduction (Hodgman, 2008).
The proposed method offers a simple risk-assessment system that associates a reliable estimate of the individual probability of developing an adverse event with predicted scores.The model is a very simple score of risk factors chosen, one or more times, by a stepwise procedure based on maximising discrimination through ROC analysis.No hypotheses or statistical models are involved.Since conventional methods for evaluating calibration, such as the Hosmer-Lemeshow test (Hosmer & Lemeshow, 2000), are u n r e l i a b l e f o r s c o r i n g s y s t e m s , w e a n a l y s e d t h e 9 5 % c o n f i d e n c e i n t e r v a l o f s a m p l eestimated risk probabilities associated with each score step by step.The experimental score probability is easily evaluated by calculating the sampling rate of adverse outcomes having that score.
Unfortunately, the statistics of the sampling error are not simple to derive.We therefore preferred to use bootstrap resampling, a method commonly used in statistical inference to estimate confidence intervals (Carpenter & Bithell, 2000;DiCiccio & Efron, 1996).The bootstrap method is simpler and more general than conventional approaches; it requires no great expertise in mathematics or probability theory and is based on assumptions that are less restrictive and easier to control.The method can be used to evaluate statistics that are difficult or impossible to determine by conventional methods.We used an elaboration of the simplest bootstrap method of percentile intervals, known as bias-corrected and accelerated intervals, which avoids estimate bias and offers substantial advantages over other bootstrap methods, both in theory and practice (Chernick, 2007).Our simulation experiments confirmed the method's accuracy in estimating 95% CI of prognostic probabilities: when true probabilities were related to score values, or classes, with a sufficient number of sampled training data, they always fell within bootstrap-estimated 95% CIs (see Fig. 3).Bootstrap techniques are not too complex in a clinical environment, since nowadays many available packages for data processing include them for calculating confidence intervals.In any case, they are used exclusively during model design.
As shown in Fig. 3, step by step graphical inspection of probability CIs made it possible to choose the best model to compromise between calibration and discrimination, also suggesting convenient pooling of adjacent scores that gave large and overlapping CIs due to an insufficient number of cases or adverse events.The controlled simulation experiments showed that good calibration was achieved with a limited number of score classes, up to a maximum of seven in experiments with the biggest sample size, and high prevalence and separation between event classes (see Table 1).More classes could be identified if greater overlap of close scores were allowed, but when the number of classes became excessive, there were problems of overfitting.We also saw that a logistic model designed on the same training data provided nearly continuous probability estimates, the uncertainty of which was similar to that achieved by the score model.Significant improvement of discrimination performance could only be appreciated when continuous variables were also included in the logistic model, as in the clinical example described.This analysis can enable medical staff to select the best scoring system for any specific clinical context.

Conclusion
In critical care medicine, scoring systems are often designed exclusively on the basis of discrimination and generalisation characteristics (diagnostic capacity), at the expense of reliable individual probabilities (prognostic capacity).Our proposed approach that weighs both these capacities is validated by suitable simulation experiments, which also allow design conditions and application limits of scoring systems to be investigated for correct prediction of critical patient risk in a real clinical context.The bias-corrected and accelerated bootstrap method for evaluating the 95% confidence interval, CI, of individual prognostic probabilities provides reliable estimates of true simulated probabilities.CIs are calculated for each score and at each step of scoring-system design.By increasing the number of steps, model discrimination power (greater AUC) and prognostic information (greater number of different score values) increases but widening and overlap of 95% CIs soon occurs, so that it becomes convenient to pool adjacent scores into score classes.The maximum number of different score classes giving distinct prognostic information, that is having narrow and less overlapping 95% CIs, increases with increasing sample size and prevalence of adverse outcome and decreasing error probability of classification.It is strongly limited by reduced frequency of score cases and the respective rate of adverse events: in our simulated experiments, which covered a wide range of real conditions, it varied from 2 to 7. Application of the method to a real clinical situation demonstrated that the technique can be a simple practical tool, providing useful additional prognostic information to associate with classes of scores, and enabling doctors to choose the best risk score model to use in their specific clinical context.

Acknowledgment
This work was partly financed by the University of Siena, Italy.

Fig. 2 .
Fig. 2. Area under the ROC curve (AUC) during stepwise selection of model features from simulated data.The predictor variables entered are also indicated

Fig. 4 .
Fig. 4. 95% confidence intervals of AHE score probabilities estimated from simulated training data, percentages of score cases and true probabilities (stars) for the model identified at step no.8

Fig. 6 .Fig. 8 .
Fig. 6.Area under the ROC curve (AUC) during the stepwise selection of model features from clinical data.The predictor variables entered are also indicated

Table 2 .
The pooling of the highest scores of the eight-step model led to a Clinical variables, cut-off values for the dichotomisation of continuous variables and score-model entry steps.NE = not entered; NS = not statistically significant; LR = variable selected by stepwise logistic regression Two logistic regression models were designed to compare the score model results on the same training data with the 15 clinical variables of Table 2.The first model, named