Methods for Improving the Predictive Accuracy Using Multiple Linear Regression Analysis to Predict the Improvement Degree of Functional Independence Measure for Stroke Patients

Multiple linear regression analysis is frequently used in studies investigating the degree of Functional Independence Measure (FIM) improvement in stroke patients. However, the coefficient of determination R2 is about 0.46 to 0.73, meaning that the prediction accuracy is not necessarily high. In order to improve the prediction accuracy, the following methods are used; using appropriate explanatory variables, using FIM effectiveness which corrected the ceiling effect as the objective variable, creating multiple prediction formulas, converting numerical variable of explanatory variables into dummy variable, adding FIM improvement for one month to the explanatory variables. Even so, it is difficult to predict patients whose FIM gain is extremely large or small. It is desirable to combine these methods or develop new methods to achieve the accurate prediction.


Introduction
Multiple regression analysis is used when predicting objective variable using multiple explanatory variables. It is also used to find out how much influence the factor (predictive variable) has on the outcome (objective variable). There are many reports of multiple regression analysis predicting the improvement degree of Functional Independence Measure (FIM) [1], which is an index of activities of daily living (ADL). However, the prediction accuracy of multiple regression analysis is not necessarily high. The coefficient of determination R 2 , which means "how much the explanatory variables can explain the objective variable", was 0.46 to 0.73 in the review of Heinemann et al. [2]. Although it is predictable as "a tendency as a group", it doesn't reach the level of "individual case prediction" [3]. Therefore, methods to improve the prediction accuracy are required.
In this paper, we describe the problems and their countermeasures in multiple regression analysis to predict the improvement degree of FIM in patients with stroke.

Using appropriate explanatory variables
The most common way to increase the prediction accuracy of multiple regression analysis is to use appropriate explanatory variables. In a review of reports that used multiple regression analysis in acute stroke patients by Meyer et al. [4], 126 factors were used as explanatory variables, and 63 of which were significant. Factors that were used in more than five prediction formulas and were significant in more than half of them were FIM at admission (significant in 46 formulas of 51 formulas, 46/51), age (30/45), previous stroke (5/10), Barthel Index at admission (6/6), neglect (4/6), dysphasia (4/6), impulsivity (4/6), and National Institute of Health Stroke Scale (5/5) [4]. However, the number of significant explanatory variables incorporated into one prediction formula was only 4.1 on average [4]. Furthermore, the Japanese Guidelines for the Management of Stroke 2015 [5] says that "even if the variables used for prediction are simply increased, the prediction accuracy does not necessarily rise [6,7], and the advantages of using the simplest prediction method are also shown [8]". On the other hand, it has also been reported that by adding comorbidities to explanatory variables, the coefficient of determination R 2 of multiple regression analysis with FIM at discharge as the objective variable rose from 0.732 to 0.798 [9] and R 2 rose from 0.61 to 0.64 by adding Stroke Impairment Assessment Set (SIAS) [10]. It is desirable to incorporate all factors which have a large influence on the objective variable to the explanatory variables. But such a set of explanatory variables has not yet been established. If R 2 does not rise sufficiently by merely increasing the number of explanatory variables, other methods should be considered.

What should we use as the objective variable?
The coefficient of determination R 2 depends on the objective variable. In the review of Meyer et al. [4], R 2 was on average 0.65 (minimum 0.35 to the maximum 0.82) when the objective variable was FIM at discharge, 0.22 (0.08 to 0.4) when the objective variable was FIM gain (FIM at discharge minus FIM at admission), and 0.08 (0.03 to 0.14) when the objective variable was FIM efficiency (FIM gain divided by the number of days in hospital). For accurate prediction, it is necessary that R 2 is 0.5 or more (desirably 0.7 or more). R 2 is the largest when FIM at discharge is used as the objective variable. So, there are many reports that used FIM at discharge as the objective variable. Specifically, in Meyer's et al. review [4], the objective value was FIM at discharge in 33 formulas, FIM gain in 20 formulas, and FIM efficiency in 3 formulas. But is it really the best way to choose FIM at discharge as the objective variable? In order to answer this question, it is necessary to understand the prediction formula. It is also necessary to understand the relationship between FIM at admission and FIM improvement index, such as FIM gain, FIM at discharge, and FIM effectiveness.

Prediction formula
Multiple linear regression analysis produces predictive formula of the form Y=aX1+bX2+cX3+d, where X1-X3 are explanatory variables, a-c are regression coefficients, and Y is the objective variable. This assumes that a linear relationship exists between the explanatory and objective variables. For example, if Y is FIM gain and X1 is patient's age, then a regression coefficient of "a=-0.3" can be interpreted as signifying that "FIM gain decreases by 0.3 points for every additional year of age. " However, this interpretation is subject to three potential problems.
First, the relationship between the explanatory and objective variables is not necessarily linear. The relationship between age and FIM gain is nonlinear. Specifically, FIM gain is roughly constant below the age of 69, while above the age of 70, it decreases linearly [11]. Thus it cannot be said that FIM gain decreases uniformly at a rate of 0.3 points per added year of age in all age brackets. The relationship between FIM at admission and FIM gain is not linear, either [12]. The countermeasure against this problem (creation of multiple prediction formulas, converting explanatory variable into dummy variable) will be described later.
The second problem is that the partial regression coefficient for the age variable differs depending on what other explanatory variables are used. Indeed, the partial regression coefficients for the age variable reported by 5 separate studies were -0.11, -0.18, -0.20, -0.267, and -0.34 [13]. In order to solve this problem, it is necessary to incorporate sufficient explanatory variables to the prediction formula.
The third problem is that the influence exerted by patient age on FIM gain varies depending on the FIM score at admission. Considering the mean FIM gain for a study in which patients were stratified into 6 sectors by FIM scores at admission and 4 sectors by age-yielding a total of 24 patient groups ( Figure 1) -one sees that the influence exerted by patient age on FIM gain differs for patients with FIM at admission in the range 36-53 points and for patients with scores in the range 90-107 points [14]. Thus one cannot say that FIM gain declines by 0.3 points for each additional year of age irrespective of FIM scores at admission. There is a solution to this problem that uses FIM effectiveness, which is a FIM improvement index that is not easily affected by FIM scores at admission, as the objective variable. Another approach is factors are combined with multiplicative coefficients. For example, one might multiply the standard FIM gain by an age-dependent influence coefficient (0.6 or similar value), then introduce a similar influence coefficient depending on cognitive function. This may be desirable rather than multiple regression analysis which adds the influence of factors [15].  [14].

Relationship between FIM at admission and FIM gain
The most relevant to FIM at discharge is FIM at admission. This can be understood by the fact that among many explanatory variables, FIM at admission has the largest value of standard partial regression coefficient beta, which means the relative strength of the explanatory variable for the objective variable. Therefore, the relationship between FIM at admission and objective variables, such as FIM gain, FIM at discharge, and FIM effectiveness, is very important.
The mean FIM gain is greatest for patients requiring moderate assistance. On the other hand, patients with low FIM scores at admission exhibit little improvement, while those with high FIM scores at admission demonstrate a ceiling effect, and both groups display little gain in FIM (Figure 2a) [16]. This figure sets motor FIM (mFIM) at admission on the horizontal axis and mFIM gain on the vertical axis. The mean mFIM gain shows a mountain shape with peaks around 25-30 points in mFIM at admission. In multiple regression analysis, it is assumed that there is a linear relationship between the explanatory variables and the objective variable. But it is difficult to approximate the relationship between mFIM at admission and mFIM gain with a straight line. One solution is to create multiple prediction formulas instead of one prediction formula. In one study seeking to predict mFIM gain, in which patients were stratified into 2 groups by mFIM at admission and 2 predictive formulas were created, yielded stronger correlation between the predicted and the measured values of mFIM gain (the original correlation coefficient of 0.507 increased to 0.641) [12]. It is difficult to create a single prediction formula that applies to all patients. While the approach of creating multiple formulas for various patient sectors will improve prediction accuracy.
Another solution is to convert numerical variable of explanatory variables into dummy variable. For example, including numerical values of the body mass index (BMI) among the explanatory variables can only lead to results of the form "FIM gain increases, or decreases, with increasing BMI. " On the other hand, by converting numerical value of BMI into a dummy variable, one can identify which BMI sector-underweight, normal weight, overweight, or obese-yields the largest FIM gain [17].

Relationship between FIM at admission and FIM at discharge
Although FIM at discharge is not a FIM improvement index, multiple regression analysis with FIM at admission as one of explanatory variables and FIM at discharge as objective variable is often used for the study of FIM improvement. mFIM at admission on the horizontal axis and mFIM at discharge on the vertical axis have a convex upward relationship (Figure 2b). mFIM at admission and mFIM at discharge can be approximated by one straight line, and the coefficient of determination R 2 becomes large. So, we examined whether the prediction accuracy of mFIM gain improves by the following method. Since "mFIM at discharge=mFIM at admission +mFIM gain", we first predict mFIM at discharge using multiple regression analysis (R 2 is high), and then by subtracting mFIM at admission from this predicted value of mFIM at discharge, we obtained predicted value of mFIM gain. As a result, the correlation coefficient between the predicted value and the measured value of mFIM gain was exactly the same as that directly predicted using multiple regression analysis (R 2 is low) [18]. If FIM gain is 0 point in all patients, FIM at discharge is the same value as FIM at admission, and the correlation between FIM at admission and FIM at discharge becomes 1.0. In other words, it is natural that there is a correlation between FIM at admission and FIM at discharge, and even though there is a high value of R 2 in the prediction of FIM at discharge, it does not mean that accurate prediction is made.

Relationship between FIM at admission and FIM effectiveness
FIM effectiveness has a merit that it does not have the ceiling effect (Figure 2c). The FIM effectiveness is defined as FIM gain/(126 points-FIM scores at admission), while the mFIM effectiveness is defined as mFIM gain/(91 points-mFIM scores at admission) [19]. The denominator in the definition here is the maximum possible improvement in FIM score for a given patient, while the numerator is the actual improvement in FIM score; thus the FIM effectiveness is a value between 0 and 1 which measures the improvement actually achieved as a fraction of the maximum potential improvement possible [19,20]. Among "studies investigating the influence of various factors on ADL improvement, " FIM effectiveness has been used more frequently than FIM gain or FIM efficiency (63 studies using ADL effectiveness, compared to 7 studies using ADL gain and 16 studies using ADL efficiency) [19]. There is a problem in multiple regression analysis that the influence of factor such as age on FIM gain differs depending on FIM at admission. In order to solve this problem, it may be effective to use FIM effectiveness as the objective variable, because it does not have the ceiling effect.
However, there have been few reports on multiple regression analysis that used FIM effectiveness as the objective variable [21][22][23]. But we obtained the following results recently. First, multiple regression analysis was used to predict mFIM effectiveness. Since "mFIM effectiveness=mFIM gain/(91 points-mFIM at admission)", by calculating "mFIM gain=mFIM effectiveness × (91 points-mFIM at admission)", we obtained the predicted mFIM gain. By adding mFIM at admission to this predicted mFIM gain, we then obtained the predicted mFIM at discharge. The mFIM at discharge that we obtained in this way had a higher prediction accuracy than that directly predicted by multiple regression analysis (the correlation coefficient between measured and predicted values of mFIM at discharge was 0.916 for the former and 0.878 for the later) [24]. We believe that this new method is effective in the accurate prediction of FIM at discharge.

Prediction of patients with extremely large or small FIM gain is difficult
When the horizontal axis is measured value of mFIM gain and the vertical axis is "residual" obtained by subtracting predicted value from the measured value of mFIM gain, the absolute value of residual is large in patients with extremely small mFIM gain and those with extremely large mFIM gain (Figure 3). The residuals of patients with extremely small mFIM gain are negative (measured values<predicted values), and those with extremely large mFIM gain are positive (measured values>predicted values). This indicates that moderate FIM gain will be predicted by multiple regression analysis and patients with extremely large or small FIM gain cannot be predicted well. It is considered that prediction accuracy will be improved by incorporating appropriate explanatory variables. However, by adding comorbidities which are considered to be very important for outcome prediction, the increase in the coefficient of determination R 2 was only 0.066 (from 0.732 to 0.798) [9].

Adding "FIM improvement for one month" to the explanatory variables
If it is difficult or ineffective to create accurate prediction formulas by inputting a number of factors available at the time of admission into the explanatory variables, then by incorporating into the prediction formula "the amount of FIM improvement for one month since admission (FIM improvement for one month)", which is the result of many of these factors, it may be possible to increase the prediction accuracy of FIM gain. Indeed, the coefficient of determination R 2 for predicting mFIM gain increased from 0.364 to 0.711, and R 2 for predicting mFIM at discharge increased from 0.744 to 0.884, by adding "FIM improvement for one month" to explanatory variables [25].

Conclusions
In multiple linear regression analysis that predicts the degree of FIM improvement of stroke, it is necessary to know the problems and countermeasures. Then, by combining various countermeasures and developing new methods, it is desirable to raise the prediction accuracy to the level of "individual case prediction".