Using an Anchor to Improve Linear Predictions with Application to Predicting Disease Progression

Linear models are some of the most straightforward and commonly used modelling approaches. Consider modelling approximately monotonic response data arising from a time-related process. If one has knowledge as to when the process began or ended, then one may be able to leverage additional assumed data to reduce prediction error. This assumed data, referred to as the anchor , is treated as an additional data-point generated at either the beginning or end of the process. The response value of the anchor is equal to an intelligently selected value of the response (such as the upper bound, lower bound, or 99 percentile of the response, as appropriate). The anchor reduces the variance of prediction at the cost of a possible increase in prediction bias, resulting in a potentially reduced overall mean-square prediction error. This can be extremely e ective when few individual data-points are available, allowing one to make linear predictions using as little as a single observed data-point. We develop the mathematics showing the conditions under which an anchor can improve predictions, and also demonstrate using this approach to reduce prediction error when modelling the disease progression of patients with amyotrophic lateral sclerosis.


Introduction
Prediction has always been an important part of statistical modeling.With the advent of big data and the rise of machine learning, one may think that researchers have moved beyond prediction via simple linear models.This is not the case, however, especially in the eld of medical research: a quick search of PubMed results in over 1000 publications which utilize linear (but not generalized linear) models from January 2016 July 2017.This is because linear models are usually one of the rst attempted approaches when analyzing new data, and they are sucient surprisingly often.Linear models are simple to calculate, requiring tiny amounts of computing power compared to some of the more complex machinelearning algorithms (such as neural networks).Most importantly, linear models are very straightforward to interpret and explain, a direct contrast to the more sophisticated black-box methods that are dependent on large datasets.The ability to interpret and understand statistical models, or model intelligibility, is especially important in the eld of healthcare (Caruana, Lou, Gehrke, Koch, Sturm & Elhadad 2015).
Yet linear models have their failings, especially when modelling a bounded response.Consider attempting to model the disease progression over time of a patient with amyotrophic lateral sclerosis (ALS), also known as Lou Gehrig's disease.This is measured by the instrument known as the ALS Functional Rating Scale Revised, or ALSFRS-R (Cedarbaum, Stambler, Malta, Fuller, Hilt, Thurmond & Nakanishi 1999).The ALSFRS-R is always an integer between 0 and 48, with 48 representing no spread of the disease and 0 being the theoretical maximal spread of the disease.The progression of the ALSFRS-R tends to be very linear (Armon, Graves, Moses, Forté, Sepulveda, Darby & Smith 2000, Magnus, Beck, Giess, Puls, Naumann & Toyka 2002), but because of its bounded nature, simple linear models have the inherent structural defect of creating predictions that violate these lower and upper bounds.Many adjustments to this problem Revista Colombiana de Estadística 41 (2018) 137155 exist: examples include truncating the prediction to 48 if the prediction is too large (0 if too small) (Amemiya 1973) or performing a logistic transform on the data (Lesare, Rizopoulos & Tsonaka 2007).
If the goal is prediction, say of the patient's ALSFSR-R at one year, these adjustments may not perform well when small amounts of observed data exist.
The small number of data-points can result in the variance of the prediction being very large, producing a large mean-squared-prediction-error (MSPE).Recall the MSPE is equivalent to the sum of the variance and squared bias of the prediction.
In this paper we consider a simple method to reduce the variability of linear predictions at the cost of potentially increasing the predictive bias.Biased linear regression itself is not new (ridge regression (Hoerl & Kennard 2000) is one wellknown example), but we do this in a unique way: by exploiting our knowledge of when the process we are modelling (e.g. the patient's disease progression) rst began.
Tracking the date when a patient rst began noticing symptoms of ALS (their disease onset time) is common practice in ALS clinics and trials.From a modelling perspective, one could use this information in a variety of ways: the most obvious way is using it as a covariate in the model.Let us try a dierent approach: if we were to assume their ALSFRS-R score at roughly the time of their disease onset, what might their ALSFRS-R be?One could argue that the patient has had minimal, if any, disease progression at time of disease onset.It seems reasonable then that one could assume their ALSFRS-R to be 48 (meaning the minimum possible disease progression) at this time.We could then create a new observation with ALSFRS-R score of 48 at the time of disease onset, and include that as one of the observations (data-points) used to build our linear model.
In this paper we consider utilizing knowledge of when a process starts to create an assumed data-point, which then can be used to reduce variability of linear model predictions.We found no previous literature on this technique in our literature search.We rst show how the inclusion of this point mathematically reduces variance component of the MSPE under the assumptions of ordinary least-squares (OLS) linear regression; then we calculate the bias component it brings to the MSPE; we deduce the condition under which this approach can reduce the MSPE in predication combined variance and bias.Afterwards we give an example of utilizing this approach in the context of modeling ALS disease progression, showing how it improves the MSPE when compared to a linear model lacking the extra data-point.We show how it is also superior to a logit transform approach.We stress that this method is a simple to understand, easy to perform, and inexpensive to implement approach.It is our hope that this idea may be utilized by pragmatic researchers to improve their linear predictions and estimations at very little additional cost.

The Eect of Using an Anchor on the Mean Square Prediction Error in Simple Linear Regression
Here we develop the theoretical results that justify the creation and use of an extra assumed data-point to improve modelling.We shall refer to this data-point as the anchor.Consider n − 1 ordered pairs {(x i , y i )}, i ∈ 1, . . ., n − 1, where y i is some response corresponding to x i .As per ordinary linear regression (Kutner, Nachtsheim & Neter 2004), assume that x i and y i have a linear relationship, meaning that for some constants β 0 and β 1 , y i = β 0 + β 1 x i + i , with independent error terms i ∼ N 0, σ 2 .Furthermore, assume an additional observation referred to as the anchor given by (x n , y n ), where y n is some xed constant in R.
Consider the problem of predicting a new value y 0 corresponding to a given x 0 , which is typically obtained by using the OLS estimates for β 0 and β 1 , denoted as a and b.Denote the resultant prediction for y 0 which utilizes the rst n − 1 coordinate pairs by Y (n−1) 0 = a (n−1) + b (n−1) x 0 , and the prediction which also includes the anchor by Y n) x 0 .Denote the errors between our prediction and the truth to be e Recall that the variance of e (n−1) 0 (which was built from n − 1 ordered pairs of data in standard OLS regression) is equivalent to: V ar e where V ar e represents the variance of the prediction error obtained from utilizing all n datapoints (meaning we include the anchor).
We rst show that V ar e , meaning any choice of anchor will decrease the variance component of the MSPE.We then derive an upper bound for the bias of the anchor such that the MSPE will decrease; in other words, how far away from the true line can the anchor be before it makes the MSPE worse.
Without loss of generality, we will assume the following for the observed data: . Any collection of (x i , y i ) can be linearly transformed in the x-coordinate by subtracting from each x i the mean of the x's and dividing by the Euclidean norm to achieve this normalization.Explicitly, each x j is transformed by applying g (x j ) = x j − x(n−1) . It is interesting to point out that this transformation has no impact on the OLS estimators for σ 2 .

Utilizing an Anchor Reduces Predictive Variability
Here we show that inclusion of an anchor in an OLS regression will always reduce the variance of newly predicted responses.Intuitively, it makes sense that the variance for the slope and intercept estimates will shrink as more points are included in the OLS regression: consider a simulation where one draws two observations and obtains the OLS estimates for the slope and intercept as compared to a simulation where one draws three observations.The latter will have less variance on the OLS estimates, resulting in less variance on newly predicted responses.
The variance is then reduced even further when one assumes that the additional observation is an anchor and has a variance of zero.
Theorem 1.For any anchor point (x n , y n ), with y n a xed constant, V ar e (n) 0 ≤ V ar e V ar (y 0 ) + V ar (a + bx 0 ) ≤ V ar e This can be simplied using our assumptions on x(n−1) and SSX (n−1) to obtain which simplies as follows using properties of variance: We next consider the individual terms V ar (a) , V ar (b) , and Cov (a, b).For convenience SSX denotes SSX (n) and x denotes x(n) .
Part 1: variance of slope Thus the nth term of the summation is zero and we can write V ar (b) as: Utilizing the assumption that n−1 Or equivalently (multiply top and bottom by n) Part 2: variance of intercept Since V ar (y n ) = 0 and V ar (y i ) = σ 2 for i ∈ 1, . . ., n − 1, we use properties of the variance to nd: V ar (a) = V ar Distributing the summation to each term results in which, after multiplying as needed to get a common denominator, is equivalent to Part 3: covariance of intercept and slope Consider Cov (a, b).We use the property that Cov ( c i y i , d i y i ) = σ 2 (c i d i ) and the fact that any covariance or variance term involving y n is 0, since y n is a constant.
Or equivalently (after multiplying as needed to get a common denominator) Part 4: proving the inequality V ar e (n) 0 ≤ V ar e (n−1) 0 Recall, V ar e (n) 0 ≤ V ar e (n−1) 0 is equivalent to the following: Substituting the previously derived terms on the left hand side results in a statement which is trivially true if σ 2 = 0. Otherwise, this statement is equivalent The right hand side of the inequality is quadratic in x 2 0 with form g (x 0 ) = Ax 2 0 + Bx 0 + C. Note the coecients A, B, C simplify to single terms in the following way: Since A > 0 for n > 2, then g(x 0 ) is an upward-facing parabola.Also, the discriminant, given by B 2 − 4AC is equal to zero: meaning there is exactly one root in g(x 0 ).Therefore, it must be true that g (x 0 ) ≥ 0 and V ar Y Thus we see that any choice of anchor will necessarily result in a reduction in the variance of the prediction of y 0 , which is equivalent to a reduction of the variance component of the MSPE.However, we still need to consider the bias.
Recall that the typical OLS estimators for slope and intercept are unbiased, but this is not necessarily true when including an anchor.We next consider how much bias will be introduced by an anchor to the estimators for the slope and intercept, and the eect this has on the MSPE (compared to the MSPE that does not include an anchor).It will be shown that any choice of an anchor (x n , y n ) such that y n = β 0 + β 1 x n will introduce bias to the model.Note that the bias is a direct function of β 0 and β 1 , which are rarely known in practice.Again, let x denote x(n) and SSX denote SSX

Predictive Bias Caused by Utilizing an Anchor
There is no such thing a free lunch, and while using an anchor brings the benet of predictive variance reduction, it can potentially inject bias into the predictions.
Here we quantify this bias in terms of the true regression slope (Theorem 2) and intercept (Theorem 3).These biases propagate in to the prediction of Y (n) 0 directly.Theorem 2. Using anchor point (x n , y n ) results in biasing the slope by This result shows that an anchor will almost always bias the estimate for the slope, however we will show that no bias is added when the anchor lies directly on the true regression line as a corollary.
Proof .Recall y i = β 0 + β x i + i and that the OLS estimate for β 1 , denoted by b, is given by b We rst derive E(b): Recall that y n is a nonrandom constant, and hence E (y n ) = y n .Then we can partition the expectation of the summation: Note that the OLS estimate for β 0 when not using the anchor point is given by a Since these are the unbiased OLS estimators for β 0 and β 1 when ignoring the anchor point, then it must be that E ȳ(n−1) = β 0 and Using these values and the linearity of expectation, we then have Or equivalently which means the bias of b is given by We next quantify the bias added to the estimate of the intercept parameter when using an anchor.
Theorem Using anchor point (x n , y n ) results in biasing the intercept by Similarly to Theorem 2, this result shows that an anchor will almost always bias the estimate for the intercept.Again, this bias is minimized when choosing an anchor that is closer to being on the true regression line.No bias is added when the anchor lies directly on the true regression line.
Proof .Recall that the OLS estimate for β 0 , denoted by a, is given by We rst calculate E (a) : Again, recall that E ȳ(n−1) = β 0 and that E (y n ) = y n .We derived E(b) in Theorem 2. Thus: which can be reduced to Therefore the bias of the intercept is With Theorems 2 and 3, we can combine these using the linearity of expected values to determine the bias when predicting a new response Y (n) 0 .
Corollary 1.The overall bias induced by using anchor point (x n , y n ) is given by Revista Colombiana de Estadística 41 (2018) 137155 which can be reduced algebraically to We next show when no bias is added when using an anchor.As mentioned previously, this typically happens only when the anchor lies directly on the true regression line; in other words when the anchor (x n , y n ) is such that Corollary 2. When using an anchor to predict y 0 for any given Proof .Recall the overall bias is given by E Y Thus the overall bias is zero if and only if the anchor point is on the true regression line, given that you are not predicting where x 0 = −1 (n−1)xn .

Using an Anchor to Reduce the Mean Square Predictive Error
We combine the previous theorems to deduce exactly when using an anchor will improve the MSPE of predicting a new response.If the variability is reduced more than the square of the bias is increased, the MSPE will shrink, which is desired.In Theorem 4 we derive an exact bound for when this occurs.Theorem 4. Utilizing anchor point (x n , y n ) reduces the overall MSPE when the following inequality holds: Note that this bound is a function of the true regression slope and intercept, β 1 and β 0 .In practice, these are rarely known, which makes this bound dicult to use as a decision rule for including the use of an anchor (at least, outside of a simulation).

Proof . Consider the following inequality
M SP E (n) ≤ M SP E (n−1) .

This is equivalent to
Bias 2 e = 0, and that V ar e The remaining pieces, Bias 2 e (n) 0 and V ar e (n) 0 , were derived in Theorem 2 and Theorem 3, and substituting them in to this inequality results in the formula given in the statement of Theorem 4.
Thus we see that any choice of anchor point will reduce the variance of prediction, but will almost always increase the bias of the prediction depending on how far away the anchor point is from the true regression line.To see why the bias increases based on how far the anchor is from the true regression line, observe that the square of the total bias is quadratic in y n , which must have exactly one root at the vertex.Given that x 0 = −1 (n−1)xn , this root occurs only when y n = β 0 + β 1 x n .Because it is quadratic in y n , the square of the bias will increase as (x n , y n ) moves further away from the true regression line.Therefore, using an anchor may be benecial or not, depending on how much bias is added.
The bound calculated in Theorem 4 could potentially be used as a decision rule for determining if using an anchor is benecial or not.Unfortunately, one needs to know the true values of β 0 and β 1 in order to use Theorem 4's result.
In practice, one tends to not know the true regression parameters (which would result in no need of including an anchor), although with suciently informed prior knowledge, precise estimates may exist.Thus, when deciding whether to use an anchor or not, we suggest comparing the anchor model to a standard model by validating the model in some way (perhaps via a cross-validation approach).We show an example of this in Section 3.
Before moving to the application section, we note that many of the ideas in this paper have Bayesian connections.For example, consider performing a Bayesian analysis of classical regression.When utilizing the standard noninformative prior distribution, the posterior mean estimates for the slope and intercept terms (and their standard errors) are equivalent to those obtained under frequentist OLS (Gelman 2014).It follows that Theorems 1-4 still hold under the Bayesian paradigm, meaning that an anchor can be utilized to reduce the variance of posterior predictions.

Application to ALS Prediction
We next consider using an anchor to improve linear models that pertain to predicting disease progression in patients with ALS.Note that the theory developed Revista Colombiana de Estadística 41 (2018) 137155 in part (2) applies to a single OLS regression (prediction for individual).The following example expands on this by showing how utilizing an anchor can improve the average prediction error across several OLS regressions (prediction for each of several individuals).
Our data comes from the Pooled Resource Open-Access ALS Clinical Trials (PRO-ACT) database (Atassi, Berry, Shui, Zach, Sherman, Sinani, Walker, Katsovskiy, Schoenfeld, Cudkowicz & Leitner 2014).In 2011, Prize4Life, in collaboration with the Northeast ALS Consortium, and with funding from the ALS Therapy Alliance, formed the PRO-ACT Consortium.The data available in the PRO-ACT Database has been volunteered by PRO-ACT Consortium members (https://nctu.partners.org/PRO-ACT/).
Recall ALS disease progression is tracked by the ALSFRS-R (see Section 1), our outcome variable, which is an integer value between 0 and 48, where 48 represents the minimal amount of disease progression and 0 represents the maximal progression.For each patient, we model the ALSFRS-R versus time (in days).Specically, time is measured in days from trial baseline, meaning x = 0 corresponds to the beginning of the trial and x = 365 corresponds to the 365 th day after the trial began.On this scale, a patient's disease onset time is typically negative, as it happened before the trial began.We required patients to have the following: (1) at least two recorded ALSFRS-R scores before 3 months, for model building purposes; (2) non-missing value for time of disease onset; (3) at least one year between the baseline and last ALSFRS-R score for MSPE-validation purposes.This resulted in 1606 patients, with an average ± standard error of 12±4.54 time-points per patient (and 3 ± 0.96 visits in the rst three months).
Note that we are now considering data on several distinct patients, each with their own ALSFRS-R trajectory.To demonstrate how utilizing the anchor-point improves OLS regression, we will simply model each patient independently with (1) a standard OLS regression model and (2) with an OLS regression model utilizing an anchor.Note that the ALSFRS-R follows a fairly linear decline, with wildly varying between-patient progression rates, justifying using linear models (Figure 1).The assumed data-point, or anchor, utilized in the anchor model comes from assuming minimal disease progression at the time of disease onset.In other words, each patient's data is augmented with the additional data point given by the ordered pair (x onset , 48), since 48 is the ALSFRS-R corresponding to minimal progression.
Our validation method is as follows: we will compare the standard model versus the anchor model by comparing their ability to predict each patient k s rst ALSFRS-R score after 365 days (1 year), observed at time x k,0 , using only ALSFRS-R scores measured before 92 days (3 months).Specically for both models we calculate (for the 1606 patients) , q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −40 −20  Finally, we compare the anchor model to that of a logit transform model.
The logit transform is a model which is more advanced, yet also more dicult to calculate and interpret.The logit transform model was chosen because it is one of the easier-to-understand models for modelling bounded data.We t the logit transform model by taking each ALSFRS-R score, dividing it by its maximum Revista Colombiana de Estadística 41 (2018) 137155 of 48, and tting the resultant data (which is bounded between 0 and 1) with a regression model.In other words, for a given patient we t the following model: logit yi 48 = β 0 + β 1 x i + ij , where ij are independent errors that follow N (0, σ 2 ), β 0 and β 1 are the intercept and slope parameters, x i is the time-point associated with ALSFRS-R score y i .The MSPE of this model comes to be 14.65, signicantly higher than the MSPE for either the standard OLS model (12.95) or the anchor model (7.78).

Discussion
In this paper, we discussed a simple and computationally inexpensive technique that may improve the predictive power in linear models.This method consists of creating an additional assumed data-point, referred to as an anchor, and including it in the OLS regression.This is dierent than xed-intercept regression, as it allows the more weight to be put on the data with respect to parameter estimation.
It has been shown in this paper that including an anchor theoretically decreases prediction variance at the cost of potentially increased bias.We demonstrated how using an anchor can improve linear predictions from modelling disease progression in ALS patients.

=
a, b be the OLS estimated intercept and slope through the points (x 1 , y 1 ) , . . ., (x n , y n ).In other words, a and b are the regression estimates for β 0 and β 1 .Since y 0 and Y Var (y 0 ) + Var (a + bx 0 ).Utilizing our assumptions on x 1 . ..,x n−1 , the inequality Var e

Figure 2 :
Figure 2: The raw prediction error for the anchor and standard model.The models' mean error as measured by Y k −Y k was 3.1 and 2.1 respectively, with standard deviations of 7.1 and 12.7.
Figure 3 shows how this changes the MSPE for various values of Γ, as well the result from changing the rule to beY k = Y (a) k if |T k | ≥ Γ instead(meaning choose the anchor model if the dierence in the model predictions is large).From Figure3we see that naively using the anchor model for all patients outperforms any of the Γ and T k decision-rule hybrids for this dataset.

Figure 3 :
Figure 3: Shows the resulting MSPE for various cutos of Γ.Note that since the MSPE is bounded below by the anchor model ( √ M SP E = 7.78), this shows that the anchor model is uniformly better than the linear model ( √ M SP E = 12.95)