A Comparative Study of the Bias Correction Methods for Differential Item Functioning Analysis in Logistic Regression with Rare Events Data

The logistic regression (LR) model for assessing differential item functioning (DIF) is highly dependent on the asymptotic sampling distributions. However, for rare events data, the maximum likelihood estimation method may be biased and the asymptotic distributions may not be reliable. In this study, the performance of the regular maximum likelihood (ML) estimation is compared with two bias correction methods including weighted logistic regression (WLR) and Firth's penalized maximum likelihood (PML) to assess DIF for imbalanced or rare events data. The power and type I error rate of the LR model for detecting DIF were investigated under different combinations of sample size, moderate and severe magnitudes of uniform DIF (DIF = 0.4 and 0.8), sample size ratio, number of items, and the imbalanced degree (τ). Indeed, as compared with WLR and for severe imbalanced degree (τ = 0.069), there were reductions of approximately 30% and 24% under DIF = 0.4 and 27% and 23% under DIF = 0.8 in the power of the PML and ML, respectively. The present study revealed that the WLR outperforms both the ML and PML estimation methods when logistic regression is used to evaluate DIF for imbalanced or rare events data.


Introduction
In psychological and educational tests, measurement invariance is a crucial assumption for comparison of mean scores across people from different cultural, racial, or demographic backgrounds. Assessing this statistical property at the item level, also known as differential item functioning (DIF), is an important part of the process of validating tests. In general, DIF analysis is used to distinguish whether the probability of responding to a specific item on a multi-item scale differs between two groups after controlling for the overall ability that is being measured by the a questionnaire [1]. Nowadays, different statistical methods including the logistic regression (LR) model, multiple group confirmatory factor analysis (MGCFA), and item response theory (IRT) model are available to assess the presence of DIF among subgroups of people [2,3]. e LR is a model-based approach first introduced by Swaminathan and Rogers to assess DIF for both dichotomous and polytomous item scored [4,5]. e LR model is able to control additional continuous and categorical confounders which may affect the results of DIF analysis. Furthermore, the LR model provides a number of effect size measures to quantify the magnitude of uniform and nonuniform DIF which may not be practically or clinically important [6][7][8]. Uniform DIF occurs when the difference in item response probabilities is constant across the scale. Nonuniform DIF is evident when the direction of DIF differs in different parts of the construct scale [9,10]. Previous simulation studies have shown that identification of DIF through the LR model may be affected by various factors such as sample size, sample size ratio, magnitude of DIF, scale length (the number of items), and the number of groups [11][12][13][14].
However, it should be noted that statistical inference based on the logistic regression model is highly dependent on the asymptotic properties of the maximum likelihood estimator [9]. Under the large sample situations, the sampling distribution of the maximum likelihood (ML) estimators for the logistic regression coefficients is asymptotically unbiased and normal. However, in small samples, it is well known that the ML estimations may be biased and the asymptotic properties may not hold [9,15]. Accordingly, a number of penalization techniques such as Firth's method have been introduced to correct or reduce the small sample bias of the ML estimators of the LR model [15,16]. A new simulation study has recently compared the performance of the LR model based on maximum likelihood (ML) and Firth's penalized maximum likelihood (PML) estimation methods in terms of empirical power and type I error rate to detect uniform and nonuniform DIF [9]. e results showed that, as compared with PML, the LR model based on asymptotic ML worked slightly better in terms of statistical power although the difference in performance was not practically important [9].
In addition to the small sample, rare events data is another factor that can substantially influence the asymptotic properties of the ML estimators in the LR model [15,17]. e bias of prediction, as well as bias in regression coefficients, is a potential problem that may arise from the use of standard logistic regression in the presence of rare events data. Rare events data refer to occurrences that take place much less frequently than more common events [18,19]. According to King and Zeng, ordinary logistic regression can sharply underestimate the probability of rare events [20]. Hence, they proposed weighted logistic regression (WLR) to correct bias terms in regression coefficients and predicted probability by applying a weighted likelihood function to estimate parameters under pure case-control sampling [20]. Moreover, a wide variety of penalization methods have been suggested to resolve the problem of rare events data in the LR model [15][16][17].
Although the effect of bias correction methods on the performance of the LR model for detecting DIF has been evaluated by Lee in a small sample [9], such an explanation has never been provided for rare events data. To fill this gap, in the present simulation study, the three inferential methods including WLR, PML, and ML are compared to discover what the best correction method is to achieve the adequate power and type I error rate for detecting DIF when the response variable in the LR model is imbalanced or rare. Hence, in a comprehensive simulation study, we investigate whether the statistical properties of the LR model for detecting DIF, with and without applying bias correction methods, can be influenced by the degree of rareness or imbalance data, magnitude of DIF, sample size, sample size ratio, and the length of the scale across reference and focal groups. In addition to simulation, we have also used real data to validate and compare the effectiveness of the proposed methods for detecting DIF in practice.

Logistic Regression Model for Detecting DIF.
e presence of uniform and nonuniform DIF can be tested by comparing three different logistic regression (LR) models as follows: In these models, the term θ is the observed ability of each respondent usually defined as total test score, and G is an indicator variable representing group membership (reference and focal groups, for example, male and female in gender variable). According to the abovementioned models, the presence of uniform and nonuniform DIF could be determined by comparing models 1 and 2 as well as models 2 and 3, respectively. In both cases, the difference between − 2 log likelihoods of the models is compared to a χ 2 distribution with one degree of freedom.
For nonuniform DIF items, the direction of DIF differs across latent constructs, and the effect of DIF is naturally cancelled out at the level of latent constructs [21] which is the major reason for focusing on uniform DIF in this study.

Maximum Likelihood. Consider the logistic regression model
where (y i , x ir ), x ir (i � 1, . . ., n; r � 2, . . ., k) and y i ∈ {0, 1}, denotes a sample of n observations of binary outcome variable y and the vector of independent covariates with 1 × k dimensions X i � (1, x i2 , . . . , x ik ), where x i1 � 1 is the constant and β � (β 1 , . . . , β k ) ′ . en, ML estimates β r of regression parameters, and β r are obtained by solving the following score equations: where 2 BioMed Research International Unfortunately, there is no closed-form expression for maximizing the log likelihood of β. e ML estimations can, therefore, be obtained using numerical optimization methods, which start with a guess point and repeat to improve. e Newton-Raphson method is one of the most commonly used numerical methods, which needs the Hessian matrix as follows: If v i is defined as π i (1 − π i ) and V � diag(v 1 , . . . , v n ), then the Hessian matrix can be written as e information matrix is given by

Penalized Maximum Likelihood.
e penalized maximum likelihood estimation method (PML) was originally developed by David Firth [16] in order to reduce the small sample bias of maximum likelihood estimates. For exponential family models, this method corresponds to penalization of the likelihood by Jeffreys' invariant prior [22]. us, the penalized log likelihood for logistic regression takes the following form: where |I(β)| denotes the determinant of the Fisher information matrix evaluated at β. Penalized maximum likelihood estimates for β r (r � 1, . . ., k) are involved in calculating where the h i 's represent the diagonal elements of the penalized likelihood version of the standard "hat" matrix: By this method, the first-order term is removed from the Taylor series expansion of the bias of the ML estimator, which has negligible impacts in large samples but can be severe in small samples or rare events. Now, Firth-type estimation, β, can be produced iteratively by Newton-Raphson algorithm with a starting value of where the superscript s is the sth iteration. If the penalized log likelihood assessed at β (s+1) is less than that assessed at β s , then β (s+1) is recalculated by step-halving. e PML estimations can also be created by carrying out the MLE method and splitting each main observation i into two new observations having outcomes, 1 − y i and y i , with weights h i /2 and 1 + (h i /2), respectively. e new observations change the score function for the ML method to ( erefore, the breaking of each observation into a nonresponse and a response ensures that the finite PML estimates always exist [23,24]. e standard error estimation is based on the root of the inverse diagonal elements of the Fisher information matrix, which is approximated by − (z 2 ln L * /zβ 2 ) − 1 [16].

Weighted Logistic Regression.
e weighted logistic regression (WLR) introduced by King and Zeng [20,25] is one of the bias correction methods which considers two correction steps. e first correction concerns the weights, to make up for the differences in the proportion of events in the sample and population. So, the following weighted loglikelihood should be maximized: where with y and τ that are the fraction of 1s in the sample and population, respectively, in which the above weighted loglikelihood is obtained under pure case-control sampling as follows: Suppose that the joint distribution of y and X in the sample is Yet, since X is a matrix of exploratory variables, then P s (X | y, β) � P(X | y, β). In the other words, the conditional probabilities of X in the sample and population are equal. However, the conditional probability of the population is But, BioMed Research International 3 Also, by replacing and rearranging, (15) where H and Q represent the proportions in the sample and population, respectively. e likelihood is then e log-likelihood for WLR can then be rewritten as where w i � Q i /H i . us, the likelihood function must be multiplied by the inverse of the fractions in order to obtain a consistent estimator. If the proportion of events in the population is more than that in the sample, then w i is more than one, and so the nonevent are given less weight, and vice versa [26,27]. e weighting method has two serious problems that limit its application: first, the calculation of the standard errors through the information matrix is severely biased. is problem will be solved by using White's heteroscedasticity-consistent variance matrix [20].
Second, slope coefficient is biased in rare events data. For solving the second problem, the bias in β can be estimated by the following weighted least-squares expression: If bias (β) is the bias component and β is the uncorrected coefficient, then the corrected coefficients β is β � β− bias(β), where Since (n/(n + k)) 2 < 1, then V(β) < V(β), and so both the variance and the bias are now diminished. e second correction step is to adjust the underestimation of the probabilities when using the bias-corrected coefficients in the logistic model. By subtracting a correction factor C i to the biased value of probability π i , (21) where V(β) is the variance-covariance matrix, X 0 is a vector of exploratory variables, and π i is the approximate unbiased estimator by means of a known τ. When τ is unknown or partially known, King and Zeng introduced π i � π i + C i as the approximate Bayesian estimator. In this case, the researcher may specify an upper and lower bound for the possible range of τ.
Maximum Likelihood (ML), Penalized Maximum Likelihood (PML), and rare event logistic regression were implemented using the "glm" function and the "logistf" R package, as well as the "relogit" function in the R package Zelig, respectively [28,29].

Data Generation.
In this study, an item response theory model for binary data was used to produce response data. e mathematical form of the IRT model is where p ij (θ) is the probability of correct response for individual i of item j, a j denotes the item discrimination parameter, b j is the item difficulty parameter, and θ i represents the ability level for the ith individual. In this study, b j (j � 2, 3, . . ., J) and θ parameters were simulated from the standard normal distribution, and a j (j � 2, 3, . . ., J) parameters were random samples of a uniform distribution within the interval (1, 2) [1].
In this simulation study, five factors were varied: sample size, magnitude of uniform DIF, sample size ratio, number of items, and the degree of rareness or imbalance. ree sample sizes (N � 200, 600, 1000) and three levels of sample size ratio (R � 1, 2, 3) were investigated. e sample size ratio between the focal and reference groups was set to 1 : 1 for the equal sample size conditions and 2 : 1 and 3 : 1 for the unequal sample size conditions. More specifically, we created conditions with n R /n F � 100/100, 67/133, and 50/150 for the small sample size (N � 200) and n R /n F � 300/300, 200/400, and 150/450 for the medium sample size (N � 600) and n R / n F � 500/500, 333/667, and 250/750 for the large sample size (N � 1000). Furthermore, the two measures with 5 and 15 items were simulated (I � 5, 15).
To generate rare events data, we followed similar scenarios and notations used by King and Zeng. In order to generate imbalanced or rare events data, we applied the logit model: log it(π i ) � β 0 + β 1 x i , π i � p(Y i � 1), by holding β 1 constant (β 1 � 1) while varying β 0 . In this case, the values of β 0 � − 3 and β 0 � − 2 generate outcomes with the percentages of ones which are equal to 6.9% and 15.6%, respectively. It should be noted that we generate the simulated data based on the IRTmodel, not the logistic regression. Hence, we need to show that the parameters of the logistic regression and the IRT model with binary responses are one-to-one correspondent. In order to denote that both models have similar mathematical expressions, equation (22) can be rearranged as log it(π i ) � − a j b j + a j θ i . By comparing this equation with the logit function log it(π i ) � β 0 + β 1 x i , coefficients β 0 and β 1 and the variable x i in the logistic regression model are found to be, respectively, correspondent to parameters − a j b j , a j and θ i in the IRT model (i.e. β 0 � -a j b j , β 1 � a j and x i � θ i ).
With the assumption of a j � β 1 � 1, the constant parameter (β 0 ) in the logistic regression is equivalent to the minus value of the difficulty parameter (− b j ) in the IRT model. Accordingly, in the IRT framework and when a j � 1, with the values of b j � 3 and 2, we can generate responses with 6.9% and 15.6% imbalanced degree, respectively.
Statistical power is defined by the proportion of times that DIF is correctly identified by the logistic method across replications, and the type I error rate, also referred to the false positive rate, represents the proportion of non-DIF items incorrectly flagged as having DIF in 1000 replications.
e type I error rates are averaged over all without DIF items [30]. Table 1 presents the statistical power of the LR model based on three different inferential methods (ML, PML, and WLR) and under various combinations of τ, R, N, DIF, and scale length. e major finding was that the power of the ML and PML was more affected by imbalanced degree (τ) than by WLR estimation method. Specifically, for the moderate magnitude of DIF (DIF � 0.4), regardless of the sample size and number of items, increasing imbalanced or rareness degree, τ, from 0.156 to 0.069 resulted in a reduction by approximately 38%, 37%, and 30% in the power of PML, ML, and WLR, respectively. Furthermore, for the severe magnitude of DIF (DIF � 0.8), a similar finding was observed, i.e., the power of the PML, ML, and WLR was reduced approximately by 37%, 35%, and 26%, respectively. In general, our findings indicated that the power of the three estimation methods for detecting DIF could be ordered from highest to lowest as follows: WLR ≥ ML ≥ PML. Indeed, for the moderate values of DIF (DIF � 0.4), compared with WLR, there were reductions of approximately 30% and 24% under τ � 0.069 and 22% and 16% under τ � 0.156 in the power for the PML and ML, respectively. On the other hand, when the magnitude of DIF was severe (DIF � 0.8), compared with WLR, there were reductions of approximately 27% and 23% under τ � 0.069 and 14% and 12% under τ � 0.156 in the power for the PML and ML, respectively.

Results
When the magnitude of DIF was moderate (DIF � 0.4), N � 1000, R � 1 and I � 15, the maximum power of the PML, ML, and WLR was 46%, 48%, and 53% for τ � 0.156 and 25%, 28%, and 32% for τ � 0.069, respectively. Hence, to achieve the adequate power (0.8) for detecting moderate DIF, we require a sample size of larger than 1000 (N � 1000). Furthermore, for the severe of DIF (DIF � 0.8) and τ � 0.156, the power of the PML, ML, and WLR was 75%, 78%, and 83% for I � 5 and 80%, 81%, and 87% for I � 15, respectively, when N � 600 and R � 1, 2. However, under DIF � 0.8, τ � 0.069, R � 1, and N � 1000, the power of the PML, ML, and WLR was 66%, 69%, and 77% for I � 5 and 72%, 74%, and 81% for I � 15, respectively. Table 2 reports the empirical type I error rate of the LR model under various combinations of imbalanced degree of data, sample size, sample size ratio, magnitude of DIF, and the number of items. In general, regardless of the magnitude of DIF, sample size, sample size ratio, and the number of items, the average type I error rates of the ML, PML, and WLR were 0.06, 0.05, and 0.03 for τ � 0.156 and 0.06, 0.04, and 0.01 for τ � 0.069. ese findings indicate that when the imbalanced degree, τ, increased from 0.156 to 0.069, the average type I error rate of the WLR decreased dramatically, while it was close to the 5% level for the ML and PML methods. However, there were some exceptions where the type I error rates of the ML, PML, and WLR were higher than the nominal level of 5%. When DIF � 0.8, N � 1000, I � 5, and τ � 0.156, the empirical type I error rate was 11%, 11%, and 9% for ML, 8%, 9%, and 8% for PML and 8%, 7%, and 7% for WLR, when R was equal to 1, 2, and 3, respectively. In this case, as shown in Table 2, when the number of items increased from 5 to 15, the type I error rate would be equal or less than 5%. e average power and type I error rate on measures with 5 and 15 items are depicted in Figures 1 and 2. According to Figure 1, irrespective of the number of items, the highest power in all combinations belonged to WLR, while PML and ML had relatively equivalent power. Figure 2 indicates that, in most simulation conditions, type I error rate was above or exactly at the nominal significance level of 0.05 for ML and below or exactly equal to the nominal level for the PML method (ML; range: 0.05-0.11and PML; range: 0.03-0.05). However, in WLR, type I error rate was lower than the other two methods and fluctuated between 0.02 and 0.05, from 0 to 0.01 for τ � 0.156 and τ � 0.069, respectively.

Real Data Example.
In this section, we used a real data set to validate the simulation findings. Accordingly, 387 Serbian individuals (40.3% male and 50.7% female) were selected out of a larger sample of 4192 people from eleven countries participating in a cross-cultural study [31]. e participants completed the 47-item Revised Child Anxiety and Depression Scale (RCADS). e RCADS divided into six subscales including social phobia, separation anxiety disorder, generalized anxiety, panic disorder, obsessive compulsive disorder, and major depressive disorder. All items use the same 4-point Likert-type response scales to assess the frequency of a certain symptom. We binarized the item responses after transformation of never and sometimes into "no symptom," and often and always into "symptom." For details on the RCADS as used in this BioMed Research International project, see [31]. e results of DIF analysis of the RCADS across male and female Serbian adolescents based on three inferential methods (WLR, PML, and ML) are shown in Table 3. Our findings showed that ten items with DIF and 33 items without DIF were common across the three methods. However, the results indicate that the WLR method is more sensitive than the PML and ML for detecting DIF. For example, in the separation anxiety subscale, while the WLR model identified four items with DIF, one and three items exhibited DIF according to PML Note. Ratio: sample size ratio between the reference and focal groups. n r and n f represent sample sizes in reference and focal groups, respectively. N � total sample size; N � n r + n f . τ: the fraction of 1s in the population. DIF: differential item functioning; ML: maximum likelihood; PML: penalized maximum likelihood; WLR: Weighted Logistic Regression. * Near to 0.01. Note. Ratio: sample size ratio between the reference and focal groups. n r and n f represent sample sizes in reference and focal groups, respectively. N � total sample size; N � n r + n f . τ: the fraction of 1s in the population. DIF: differential item functioning; ML: maximum likelihood; PML: penalized maximum likelihood; WLR: Weighted Logistic Regression.    and ML, respectively. Moreover, as compared with ML and PML, WLR gives smaller standard error for regression coefficient, and it can result in increased likelihood of finding DIF.

Discussion
Detecting DIF based on the logistic regression model can be challenging for binary data where events are rare or severely  imbalanced.
e present study is designed to clarify the unknown consequences of rare events data in the context of DIF analysis based on the LR model with and without applying bias correction methods. Hence, in the comprehensive simulation study, we compared the performance of the WLR, PML, and ML methods for detecting DIF with focus on imbalanced binary response data.
According to our simulation findings, of the three estimation methods for detecting DIF, the WLR seems to be the most appealing with regards to its statistical properties: its type I error rate is close to or less than the nominal 0.05 level and its power is considerably higher than the PML and ML. Moreover, in the present study, comparing the three inferential methods for the real data set confirmed the findings of the simulation. Accordingly, as compared with WLR, PML and ML were less sensitive for detecting DIF across male and female Serbian adolescents.
It is worthwhile to explain why the WLR outperforms the PML and ML estimation methods for DIF analysis. Indeed, the WLR is an extension of the small-sample bias corrections, as described by Cordeiro and McCullagh, to the weighted likelihood [32]. A previous simulation study has shown that this bias correction method reduces both the bias and the variance, thereby offering the smallest Mean Squared Error (MSE) when compared with that of PML and ML estimation methods [33]. Hence, applying the WLR in DIF analysis can lead to reduction in the standard error of the parameter estimates, which consequently increases the likelihood of finding DIF. Another advantage of the WLR is that it gives unbiased predicted probability [15,20]. However, the predicted probabilities from the WLR model may fall outside the plausible range of 0 to 1 [15].
Furthermore, our findings revealed that the logistic regression model based on the traditional ML estimation method had a slightly better performance than the PML for imbalanced or rare events data. ese findings were similar to those of Lee who reported that the ML slightly outperformed the PML for detecting DIF in small samples [9]. In this case, as described by Lee, it seems that the sacrifice in the precision cannot be compensated for the bias reduction of the PML estimates in imbalanced data. It should also be noted that while Firth's PML method reduces the bias in the estimates of regression coefficients, it introduces bias in the predicted probabilities, and this bias is not negligible for rare events data [15]. To overcome this problem, Puhr et al. proposed two modified versions of Firth's PML resulting in unbiased predicted probabilities [15]. Accordingly, the proposed methods, including Firth's logistic regression with intercept correction and Firth's logistic regression with added covariate, efficiently improve on predictions from Firth's logistic regression [15]. Since these methods were not implemented in the standard statistical software such as R, we were not able to apply them in the context of DIF analysis and compare their results with those of the WLR model. Hence, in the present study, the WLR is the most efficient method for DIF analysis under rare events data. e WLR not only reduces the bias of regression coefficients but also provides unbiased predicted probability in comparison to the PML method [15,20].
What distinguishes this study from the previous one is that we simultaneously evaluated the effect of imbalanced data and small samples (as well as large samples) on the performance of the three estimation methods for DIF analysis. In addition, one of the advantages of the present study is that it provides a guideline about the required sample size for DIF analysis with imbalanced or rare events data. Previous simulation studies have shown that the minimum sample size for DIF analysis with logistic regression should be within the range of 100 to 200 per group [34,35]. However, for moderate imbalanced data (τ � 0.156) and severe DIF (DIF � 0.8), as a general rule of thumb, we would suggest imposing a minimum of 300 respondents per group to achieve the adequate power with the WLR method. In addition, for severe imbalance rate (τ � 0.069) and DIF � 0.8, the WLR performs well to ensure the acceptable power with samples of 500 per group. To find what the minimum number of sample size is to achieve the adequate power (%80) for moderate DIF, we simulated items with DIF � 0.4, τ � 0.156 and 0.069, and sample sizes greater than 500 in each group, not reported in the results section. e findings revealed that sample sizes of at least 1000 per group for τ � 0.156 and 2000 per group for τ � 0.069 are required to detect DIF � 0.4 based on the WLR method.
In the present study, we restricted our simulation to only two bias correction methods for detecting DIF, namely, WLR and PML. Nevertheless, there are a limited number of studies that evaluated DIF based on other penalization methods such as LASSO. For example, Tutz and Schauberger applied LASSO penalization for DIF analysis which includes continuous variables [36]. Furthermore, Magis et al. proposed a new DIF detection method based on LASSO where all items were simultaneously evaluated for DIF in a single modeling approach so that multiple testing would not be a problem [37].

Conclusion
Although the logistic regression (LR) model is one the most common methods for detecting DIF, certain sampling strategies and appropriate bias correction techniques should be applied when LR is implemented on moderate or severe imbalanced data sets. In summary, our findings revealed that, as compared with ML and PML, the WLR is a more sensitive method for detecting DIF when data are imbalanced or rare. Hence, as well as its easy application to existing software, the WLR introduced by King and Zeng is strongly recommended for detecting DIF due to higher power and lower type I error rate in comparison to PML and ML inferential methods. However, in the future studies of DIF, penalized and bias correction methods should be proposed for ordinal logistic regression in the presence of rare events or small sample setting.

Data Availability
is is a simulation study, and the real data set has been provided by Dr. Dejan Stevanovic the third author of the manuscript.

Conflicts of Interest
e authors declare no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.