EFFECT OF USING BIAS-CORRECTED ESTIMATORS IN LOGISTIC REGRESSION MODEL IN SMALL SAMPLES: PROSTATE-SPECIFIC ANTIGEN (PSA) DATA

This study investigates the effect of bias-corrected estimators in analyzing real-world skewed data where categorization and transformation are necessary. It also reports a small-scale simulation study to indicate factors which can influence the bias correction to be small or large. For the complete data-set, it is observed that the maximum likelihood estimates and Schaefer’s bias-corrected estimates are not greatly different. However, when the original sample size is reduced by about 50%, the difference between the estimates is found to be much larger, possibly even large enough to influence the conclusions drawn. The impact of transformation and categorization is visibly present. However, the broad impression gained in categorization is the same though difference in types of categorizations can not be overlooked. A factor which seems to influence the size of the bias correction is identified.


INTRODUCTION
The popular method used to estimate the parameters of a logistic regression (LR) model is the maximum likelihood (ML) method. The ML estimates are asymptotically unbiased. However, for small samples, these estimates have substantial bias and can thereby lead to incorrect conclusions concerning the effects of individual explanatory variables. Bias correction procedures are available in the literature (e.g., Schaefer (1983)). However, these bias corrections have found little use in practice which leads to some obvious questions. How much do these corrections affect the analysis of real-world data? If it is little, what is the justification behind advocating these corrections? Moreover, researchers do choose to categorize the continuous explanatory variable. One reason for this is to see how the logistic transform of the response probability varies over the levels of categories of an inherently continuous variable. Furthermore, for variables having substantial skewness, it is conventional to transform the data (e.g., using the log or square root transformation).
This study aims to search for answers to the questions raised above. In doing so, we use a real-world skewed data set (see section 2 for description) obtained from the Tibblin et al. (1995) study. The explanatory variable in the data set is considered in its original continuous form, in logarithmic form and in categorized form of different types. The data is used in its full and 50% reduced size to see the effect of the sample size . We use Schaefer's bias correction and the maximum likelihood estimates for the LR model parameters. To see which factors can influence the bias to be small or large, small-scale simulation experiments are included.
The limitations of such a study should be stressed. From a single real-world application, we can never obtain general results. What we can get is information on the size of differences in a situation when we know what the parameters stand for. This makes it easier to judge whether differences between methods are large or not, something which is not always easy to do in the artificial setting of a simulation experiment. A further application of the different methods to other real-world examples in combination with theoretical and simulation results should in due course give additional information so that it will be possible to judge whether the bias-corrected estimator has to be seriously considered in applied work.
Section 3 introduces Schaefer's bias correction while section 4 is devoted to a description of the categorization of continuous explanatory variable. In section 5, we present the results with discussion. Finally, we conclude in section 6. Tibblin et al. (1995) present the results of a study which was performed to elucidate the performance of prostate-specific antigen (PSA) as a screening test (PSA, a blood test, is specific for the prostate, but not for clinically significant cancer). To investigate whether an increased PSA level predicts the subsequent occurrence of a clinical cancer and how long the clinical diagnosis can be advanced by PSA testing (the lead time), they performed a case-control study nested within a population-based cohort of men with 11 years of follow-up. The study population consisted of all men born in 1913 and alive in 1980 (at the age of 67) in Gothenberg, Sweden. The actual cohort, the men of 1913, consisted of all men meeting these criteria, who were born on a date divisible by 3. Out of 921 sampled individuals, 707 men participated in the study. The cases were the men who developed cancer during 1981-1992. Sera for 36 subjects who developed cancer were considered in the study. For each case, two individually matched controls were randomly selected from those still alive at the time of diagnosis of the case without themselves having had prostate cancer prior to that date. Because serum was lacking for some subjects, the final analysis included 36 cases and 68 control subjects. For further information about the study see Tibblin el al. (1995) and references therein. Some summary data are presented in Table 1. Further summary statistics for PSA and log PSA (LPSA) are presented in Table 2. It is clear that the log transformed data is much more symmetric than the original strongly skewed data. The standard deviation of the original PSA variable is almost three times as large as the mean value, while for LPSA the standard deviation is slightly smaller than the mean value.  3.59 (1.07) TC refers to data in categorized data as used in Tibblin et al. (1995) The basic study was a matched case-control study. Tibblin et al. (1995) present the results of conditional maximum likelihood method (for the LR model parameters β i ); however, we show the results (Table 3) for both conditional and unconditional ML. It is clear that there is no substantial difference between the conditional and unconditional ML estimates irrespective of whether we use original, log transformed or categorized data. Therefore, in the following, we will not retain the matching and will consider several different estimates based on the standard logistic regression model.

SCHAEFER'S BIAS CORRECTION
Let Y i ∈ {0, 1} denote a dichotomous dependent variable, and let x i denote a k + 1 dimensional vector of explanatory variables, for the ith observation. The probability that Y i = 1, given the value of x i , is assumed to be P(Y i = 1) = π(x i ) and is defined by the logistic regression model as variables. The logit transformation in terms of π(x) is given by (1) The ML estimate of β can be obtained from the likelihood equation where X is an nx(k + 1) matrix of explanatory variables, Y is an nx1 vector of values of the dependent variable, π is an nx1 vector of π i 's and 0 is a (k + 1)x1 vector of zero's. The likelihood equation is non-linear in β 0 , β 1 , ......., β k and is solved by suitable iterative methods.
The Schaefer (1983) bias correction (SBC) formula is given by where β ML is a (k + 1)x1 vector of the ML estimates and Here X is the nx(k + 1) matrix of explanatory variables, x j T is the jth row of X, and V is an nxn diagonal matrix of the variances of π j , π j (1-π j ) that is V = diag{π j (1-π j )}. The term in the brackets represents an nx1 vector with the jth element (1-2π j )x j T (X T VX) -1 x j .
Using ML estimate π j in (4), an estimate of the bias is obtained and can be used in (3).

CATEGORIZATION OF CONTINUOUS EXPLANATORY VARIABLE
If we do not have any prior knowledge about the data (such as, for blood pressure data, it is roughly known that certain values represent dangerously low or dangerously high values), a common procedure is to categorize the data as quantiles.
In categorizing continuous data, the first job is to estimate the i th quantile Q i , i = 1, 2, ....., k. We define the design variables (Table 4) for k = 4 where the quartiles are based on all observations. However, the Tibblin et al. categorization is based on certain natural cut-offs which is also an approximate quartile categorization using only control data.
In our case, Y is the cases and controls with Y = 1 for cases and Y = 0 for controls, and X is the PSA. The PSA is also considered in its logarithmic form (i.e., X is the LPSA) and in categorized form where X i s are D i s, i = 2, 3, 4. The following logistic regression models are used log π log π The design variable D 0 is used to perform what is often called a trend-test for the variable in categorized form using

RESULTS AND DISCUSSION
In order to analyze the data, maximum likelihood estimate (MLE) and Schaefer's bias correction estimate (SBCE) were used. To test the significance of the regression coefficients in the continuous variable case (or in its logarithmic form), the likelihood ratio test (LRT), the score test (SCT) and the Wald test (WALD) were used. However, only LRT was used to test the overall significance in categorized and linear trend cases. Results are presented in Table 5 through Table 9. For PSA the odds ratio (OR) associated with a unit change is 1.48 according to the ML method, while the corresponding figure for LPSA is 6.31 (regression coefficients estimated from the LR model are the logarithm of OR). The corresponding figures for the bias corrected method are 1.45 and 5.89. Compared with the lowest category of PSA values, doubtful and non-significant risks are seen for values less than 4.0. In contrast, a PSA value greater than 4.0 is associated with a more than 30-fold increased risk of developing cancer.
For the data set with PSA, the observed absolute difference between MLE and SBCE is found to be 0.0482 (0.0161) for β 0 (β 1 ). However, for LPSA a much larger difference, 0.0733 (0.0688) for β 0 (β 1 ), is observed. The relative difference is rather similar, however (4.1 and 3.7% for the slope). These results are not exactly comparable to the simulation study results of Schaefer (1983) and Matin (1994). However, for the sample size 100, Matin (1994) obtained an absolute difference between mean MLE and mean SBCE of 0.0407 (0.0830) for β 0 (β 1 ) for a model with β 0 = -2 and β 1 = -2.
To show the effect of sample size, the analysis (Table 7) is based on the diagnosis years 1981-86, which comprise 51 observations, and 1987-91, which comprise 53 observations. Thus we reduce the sample size by almost 50%. The difference between MLE and SBCE is now much larger than in the cases previously considered with 104 observations. This result is in agreement with Schaefer (1983) and Matin (1994Matin ( , 2005 where in small samples the difference between MLE and SBCE is found to be larger. Compared with the original skewed data, a much larger difference between the estimates is obseved for LPSA. The relative difference is rather similar and markedly larger than for the complete data set (as expected).   (Table  7). This may be a difference of some importance in practice.
Regarding the significance of the regression coefficient (i.e., to test H 0 : β 1 = 0) all the test statistics in the continuous variable case show significant results. For PSA the test statistics show the relationship SCT < WALD < LRT, but for LPSA the relationship is WALD < SCT < LRT (Table 8). Matin (1998Matin ( , 2005 found the later relationship true for the mean values of the test statistics although that does not hold in every sample. If LPSA rather than PSA is used in univariate analysis, all the test statistics have larger values (Table 8). This supports the logarithmic transformation of the PSA data.
For the Tibblin et al. categorized data, with a sample of size 104, the difference between MLE and SBCE is virtually nil except for the fourth category where the difference is found to be only 0.0140 (Table 6). Similar comments can be drawn for the quartile categorized data. These results are not surprising at all, since Schaefer's study (1983) confirms that as the number of explanatory variables (continuous) increases with sample size, the bias reduction becomes negligible. For the quartile categorized data, the difference between ORs (computed from MLE and SBCE of the fourth category) is found to be approximately 5 units ( Table 6). As the ORs are large, it is doubtful if this difference is of any practical importance. A strongly significant overall effect is obtained for both categorizations of the data.    (Table 6) in all other categories when compared with the first. The risk increases are not significant in the second and third categories. The very high fourth quartile OR on the other hand is highly significant. With QC there is no difference between the risk in the first two categories. The third and fourth category risks on the other hand are larger and significant even in the third category case according to the QC. Obviously, the categorization has effect on the results obtained although the broad impression gained is the same.
Categorizing the data on the basis of all observations as in QC or on the basis of the controls only (which is what is approximately done in TC) produces quite different results in a case such as the present one with a large risk associated with the explanatory variable. The large difference between the two categorizations can be seen already from the number of observations in the different categories, a very even distribution of 25-27 in the case of QC, but a much larger variation (21,18, 20 and 45) in the case of TC. A consequence of this is the larger OR obtained in the former case for the fourth category compared with the first one.

Factors which effect the bias correction
Can we say anything about when the bias is large and when it is small given a certain sample size? The term 1 -2π j in (4) is equal to 0 if π j = 0.5, if π j > 0.5 1 -2π j assumes negative value, if π j < 0.5 1 -2π j assumes positive value and 1 -2π j ∈ (−1 1). For a particular j, when π j = 0.5, the term in curly brackets becomes 0 and contributes nothing to the whole term in (4). However, as the value of 1 -2π j moves away from zero its contribution increases according to the sign of 1 -2π j . To see how this term can influence the bias correction, we conducted a small-scale simulation study. In doing so, the original 104 PSA values were considered fixed and the values of Y were generated conditional on PSA as in equation (5), whereby Y = 1 with probability [1 + e -(β 0 + β 1 PSA) ] -1 , and = 0 otherwise. The parameter pair β 0 = -2.6271, β 1 = 0.3897 (given in Table 5 as the ML estimates of LR model parameters) was used to compute the probability mentioned above. Then the LR model (5) parameters were estimated with these new cases and controls and the π j 's were computed. This experiment was repeated five times. The same procedure was applied to PSA based on the diagnosis year 1981-86 with 51 observations and the parameter pair β 0 = -2.7441, β 1 = 0.3388 (given in Table 7 as the ML estimates of LR model parameters). Results are presented in Table 9. To facilitate the comparison of these results, the difference between MLE and SBCE is included in the last two columns of Table 9.
With n = 51, the differences are found to be larger for the sample numbers 1 and 3, which are accompanied by a smaller mean value of 1 -2π j . The inter-quartile difference of 1 -2π j is larger for the same sample numbers. With n = 104, sample number 5 provides similar results. Furthermore we know that the logistic function is essentially linear for π j ∈ (0.20 0.80), but outside this interval it becomes markedly non-linear (Collett, 1991). The case where π j ∈ (0.25 0.75) for 50% of the indices j, 1 ≤ j ≤ n is more favourable for ML estimates (Duffy and Santner, 1988). Now as a clue, if the value of π j ranges on both sides of 0.5 symmetrically, then each positive contribution of the term 1 -2π j is counterbalanced by a negative contribution at least for the linear components of the logistic function. Hence, almost no contribution needs to be added to the whole term in (4). Also, the closer the π j values are to 0.5, the smaller the contribution to (4). Smaller contributions are more likely in large samples than in small ones, because cancellation of each positive contribution by a negative one is more likely due to the more symmetric nature expected in large samples.

CONCLUSION
With the explanatory variable in its original continuous form, a small difference between MLE and SBCE is observed. However, the conclusions are not affected by the method used. When the original sample size (104) is reduced by almost 50%, the estimated difference is larger. The differences are possibly so large that they may influence the conclusions drawn. The log transformed data is much more symmetric than the original skewed one. Compared with the skewed data, a much larger absolute difference between the estimates is observed for the log transformed data, although the relative difference is not large. For the original sample size, the ORs obtained are very similar. However, in practice, for small samples the difference between ORs is found to be of some importance. The categorization has an obvious effect on the results obtained. However, no real difference is observed between the estimates of the two methods, the fourth category serving as the only exception. That is, the broad impression gained is the same although the large difference between the two types of categorizations can not be overlooked. It is observed (from the small-scale simulation study) that a smaller mean value of the term 1 -2π j is accompanied by a larger difference between the estimates of the two methods used in the study. Thus, the term 1 -2π j can influence the bias correction to be small or large. The closer the π j values are to 0.5, the lesser will be the contribution to the bias correction.