Sample size issues in multilevel logistic regression models

Educational researchers, psychologists, social, epidemiological and medical scientists are often dealing with multilevel data. Sometimes, the response variable in multilevel data is categorical in nature and needs to be analyzed through Multilevel Logistic Regression Models. The main theme of this paper is to provide guidelines for the analysts to select an appropriate sample size while fitting multilevel logistic regression models for different threshold parameters and different estimation methods. Simulation studies have been performed to obtain optimum sample size for Penalized Quasi-likelihood (PQL) and Maximum Likelihood (ML) Methods of estimation. Our results suggest that Maximum Likelihood Method performs better than Penalized Quasi-likelihood Method and requires relatively small sample under chosen conditions. To achieve sufficient accuracy of fixed and random effects under ML method, we established ‘‘50/50” and ‘‘120/50” rule respectively. On the basis our findings, a ‘‘50/60” and ‘‘120/70” rules under PQL method of estimation have also been recommended.


Introduction
Individuals, who are drawn from a hospital, school or a classroom, tend to share more homogeneity as compared to those drawn from a population which is very large in size. As such individuals will always enjoy various common properties like family background, morals and values, religion, socio-economic status, demographic, etc., complete independence of observations in such situations is never going to happen [1]. If we have nested data or multilevel data, the assumption of independence will be clearly violated and the application of analysis of variance (ANOVA) and linear regression will be incorrect because these two classical models assume independence, so substitute statistical models (Multilevel Models) needed to examine and analyze such nested data [2]. Most of the time the data is in the form of multilevel data structure like in hospitals and educational institutions, and for this type of data researchers frequently used statistical models called multilevel models, hierarchical models, mixed effects models [3], [4], which are gaining recognition very rapidly. For the last 10 years, these models have become much admired and still are on rise in terms of popularity among researchers in various fields. As one of the prime questions in any field of research is to decide about an a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 appropriate sample size, the decision and issues regarding sample size are not very straight forward in multilevel modeling. Therefore, for a quantitative study, the decision about the optimum sample size can be extremely tricky due to estimation complexity of the models and the size of the sample at each level. The issues of sample size in multilevel models have been discussed by various researchers for continuous response variable. According to [5] for a model having fixed coefficients, the group size of less than 10 is enough. However, for random coefficients a group size of � 10 is needed. [6] Concluded that to get high power and accuracy one should use more level-2 units than level-1 units. Similarly, for level 2 effects and cross-level interactions the power of the test mainly depends on level 2 units. [7] carried out simulation study regarding sample size issues in multilevel models for a continuous response variable and they determined that for fixed effects 10 groups are sufficient, for contextual effects 30 groups are essential and for valid estimation of standard errors 50 groups are required. Similarly, Maas and Hox [8] carried out another simulation study for a continuous response variable by taking three groups (30, 50,100), three group sizes (5,30,50) and Intra Class Correlation i.e. ICC (0.1, 0.2, 0.3). They concluded that across all simulated conditions, the estimates were unbiased and reported under estimation of level 2 variance components when number of groups was below 100. It was also concluded that for better estimation at least a sample of size 100 is needed for level 2. However, there is fewer research conducted in the context of binary response variable. Moineddin et al., [9] performed simulation study for the determination of sample size for multilevel binary logistic regression model with single level-1 explanatory variable and single level-2 explanatory variable and by taking three groups conditions(30,50,100), three group sizes (5,30,50) and ICC (0.04,0.17,0.38). They came to the conclusions that when the number group is equal to hundred with a group size of fifty or more, the fixed effect parameters estimates were unbiased. Secondly, when the number group is equal to hundred with a group size of fifty, the variance components were reported to have a bias. The amount of bias was extremely high for the random effects as well as for the fixed effect when the group size was five. The standard errors for the variance components were underestimated and for fixed effect parameters they were unbiased. Paccagnella [10] used a multilevel binary logistic random intercept model in the simulation study and explored that similar to continuous response variable model, the bias of fixed parameter estimates decreased with increase in number of groups. Acceptable coverage rates were achieved for fixed effects estimates when number of groups was 50. Unlike continuous outcome variable models, a very large number of groups is needed to achieve acceptable coverage rates for the variance components estimates.
Zeng [11] proposed a Bayesian spatial generalized ordered logit model to analyze freeway crash severity. The suggested model was superior as compared to the traditional generalized ordered logit model in terms of statistical significance of the spatial term and better model fit. Similarly, to analyze crash rate by injury severity, three temporal multivariate random parameters Tobit models were developed by Zeng [12]. In all of the temporal models, significant temporal effects are found and the goodness of fit (Bayesian R 2 ) of the multivariate random parameters, Tobit regression improves considerably due to the inclusion of temporal correlation. The inclusion of spatio-temporal correlation and interaction in a multivariate randomparameters Tobit model and their influence on fitting arial crash rates with different severity outcomes have been investigated by [13] in the Bayesian context. The proposed model performs better in terms of model fit than a multivariate random-parameters Tobit model and a multivariate random parameters spatial Tobit model.
A driving simulator experiment was conducted by [14] to investigate the safety of the truck under crosswind at the bridge-tunnel section. Steering angle and the yawing rate were the indices of the dynamic response under crosswinds. To prevent the possible accident, the authors recommended various safety options. In another study, [15] used Mixed logit models to reveal random effects. This was the first ever investigation of the difference in driver-injury severity between single vehicle (SV) and multi-vehicle accidents (MV). Respective critical risk factors of SV and MV accidents were evaluated and compared. Comprehensive observations, which have not been covered in the existing studies, were made. Additionally, to examine factors affecting injury sustained by two drivers involved in the same rear-end crash between passenger cars, a random parameters bivariate ordered probit model has been developed by Chen [16]. The proposed model outperforms the two separate ordered probit models with fixed parameters.
In multilevel models small group sizes such as 5, 10, and 15 and 20 are usually considered in education, behavioral science, etc. But here, large group number and moderate group sizes have been utilized. As compared to the linear multilevel models, larger group numbers are needed for multilevel logistic regression models. That is why small group number has been ignored in this study. Moreover, Bayesian methods may also be very useful in such situations, and can reduce model misspecification and estimation bias significantly.
So far, very little research has been conducted in the literature regarding sample size determination in the context of multilevel logistic regression models. For example, there is nothing in the contemporary research about PQL method in multilevel logistic regression models. Therefore, the present study attempts to capitalize on a novel state-of-the-art rule that encompasses all the weaknesses of the available methods for both fixed and random effects estimates under ML and PQL methods of estimation. In addition, the present study also provides guidelines about optimum sample size needed for multilevel logistic regression models. Random intercept and random slope model with two level-1 and one level-2 explanatory variables using threshold parameter concept are used. Further, larger random effects are incorporated in the present study which were unnoticed in the previous literature. Moreover, relevant factor and their levels were also ignored in [9] and [10]. A detailed comparison of ML method and PQL method has been made in terms of sample size.

Multilevel Logistic Regression Model:
A very popular concept is used in social sciences to develop a dichotomous multilevel logistic model through a latent continuous variable model [17]. A threshold concept is used that the latent continuous variable Y � ij underlies the observed variable Y ij . A simple two level dichotomous model is The combined model was obtained by substituting level 2 model in level 1 model: Where X 1ij and X 2ij are the Level-1 explanatory variables, W j is the Level-2 explanatory variable, Level 1 coefficients denoted by β and γ 0 s are the fixed effects. If e ij * logistic (0,π 2 /3), the model is, then, referred to as multilevel logistic model [18]. Random effects of level 2 assumed to have a multivariate normal distribution It should be noted that the equation for Intra Class Correlation (ICC) is According to [19], ICC is an estimate of the total variance explained by the grouping structure. Now Y � ij can be linked with the observed variable Y ij through a threshold γ, and this threshold is also the intercept of the above model. As we have only two categories, so The other concept or approach towards multilevel logistic regression models is that of Multilevel Generalized Linear Models. Both approaches lead to equivalent models, but certainly different at the conceptual level.
Let Y ij be a binary response variable representing the occurrence or nonoccurrence of some characteristics having values 0 and 1, corresponding to the individual level unit (i = 1,2. . .. . .nj, j = 1,2. . .. . .N) and i is nested in j. A multilevel dichotomous logistic model with two level 1 explanatory variables and single level 2 explanatory variable can be written as The combined model was obtained by substituting level 2 model in level 1 model: This particular model was utilized in the present study.
It should be noted that the lowest error e ij is absent in the Eq (1) because it is part of Generalized Linear Model specification [20]. This particular framework of generalized linear model is very popular in biostatistics.
It would be easier to formulate the equation as P ij = expit (regression equation) where expit is the inverse of the logit function [6]. Then for simulation studies, one would have to specify the mean and variance of all predictor variables, and the values of all regression coefficients. The predictor variables would be randomly generated, and the expit function would turn the continuous prediction into a proportion, which can then be dichotomized according to the chosen threshold. Similarly, one can use the following expression

Simulation design
The threshold parameter was set to (γ 00 = -1.22), corresponding to approximately twenty percent prevalence rate of the response variable. Similarly, the other fixed effect parameters (γ 10 , γ 20 ,γ 01 ,γ 11 ) were set on the analogy of the studies conducted by [7] and [9]. That is γ 10 = 0.3, γ 20 = 0.3, γ 01 = 0.3, γ 11 = 0.3. The explanatory variables (X 1ij ,X 2ij and W j ) were all generated from standard normal distribution. u 0j � Nð0; s 2 u Þ and u 1j � Nð0; s 2 1 Þ, where s 2 u follows from intra-class correlation specification. The s 2 1 value was set to 1 in all simulation scenarios and for simplicity, the term σ u1 was set to zero.
Four scenario for the number of group's factor and three each for group size and ICC were used. The number of Groups were taken as (30, 50,100 and 120), Group sizes were (5, 30 and 50) and ICC were set to be (0.1, 0.2 and 0.4). It means, we have 4×3×3 = 36 scenarios and for each one, the number of simulations "R" was set to be 1000.

Analysis
The accuracy of different fixed effect and random effect parameters estimates were calculated through the relative bias = (estimate-parameter)/parameter. Empirical coverage rates of 95% confidence intervals were used to judge the accuracy of the standard errors of estimated parameters. The 95% confidence intervals coverage rates were computed in each scenario as the proportion of replications in which the true parameter is captured by the 95% confidence interval. Bradley recommended acceptable coverage rates as 92.5% to 97.5% [21]. Empirical powers were computed for the X 1ij X 2ij , W j , and X 1ij ×W j . The power was calculated as the number of replications in which H 0 of null effect was correctly rejected at 5 percent level of significance divided by 1000 as 1000 replications was used for each scenario. Moreover, a separate logistic regression was used to judge the influence of various simulation scenarios on estimates empirical coverage rates.

Average relative bias of the Multilevel Binary Logistic Model Fixed Effects and Random Effects
Estimates across all conditions under ML method of estimation is presented in Figs 1-7. Estimates have negligible bias when the number of groups is large. Figs 1-7 indicate that the bias reduces significantly with the group sizes and the number of groups for all the estimates. Estimates were substantially biased in conditions when the number of groups was 30, group size was 5 and random effects had their smallest values. The relative bias was generally less than 5% when the number of groups was 50. For threshold estimate, the relative bias was negative, and for rest of the fixed effect estimates, the relative bias was positive. Figs 1-7 show that the bias reduces significantly with the number of groups for all the estimates. The group size factor has minimal impact on estimates average relative biases. Table 1 reflects the influence of the number of groups on multilevel binary logistic model estimates empirical coverage rates under ML method of estimation. This actually indicates a significant effect of the number of groups on the accuracy of estimates standard errors. The largest non-coverage for threshold parameter estimate was 5.8% when the number of groups was 30. Similarly, for γ 10 , γ 20 , γ 01 and γ 11 the largest non-coverage rates were      5.5%,5.6%,6.1%,6.3% respectively. Furthermore, for σ u and σ 1 the largest non-coverage rates were 11.2% and 10.6% respectively. The non-coverage rates decreased significantly with increasing the number of groups. The influence of the number of groups was significant on the empirical coverage rate for both fixed effects and random effects estimates.
Similarly, Table 2 reveals the influence of group size factor on empirical coverage rates in multilevel binary logistic model estimates under ML method of estimation. The group size factor did not play a dominant role in raising the accuracy of standard errors of the estimates. The coverage rates of fixed effects estimates were all acceptable at all group sizes. A separate logistic regression was used to judge the effect of group size levels on estimates empirical coverage rates. P-values indicates the impact of group size factor on both fixed and random effect estimate empirical coverage rates.
Moreover, Table 3 shows the influence of ICC on multilevel binary logistic model estimates empirical coverage rates under ML method of estimation. The influence of different levels of ICC was insignificant on empirical coverage rates of both fixed effects and random effects estimates when separate logistic regression was used to judge the effect of different ICC conditions on empirical coverage rates of estimates. It can be observed that the bias reduces significantly with the group sizes and the number of groups for all the estimates. Table 4 reflects the influence of the number of groups on multilevel binary logistic model fixed effects and random effects estimates empirical coverage rates under PQL method of estimation. The largest non-coverage for the threshold parameter estimate was 8.6% when the number of groups was 30 and it reached 7.2% when the number of groups was 120. Similarly, for γ 10 , γ 20 , γ 01 and γ 11 the largest non-coverage rates were 8.7%,8.7%,8.8%,9.6% respectively. Furthermore, for σ u and σ 1 the largest non-coverage rates were 13.5% and 12.6% respectively. The influence of the number of groups was insignificant in most of the conditions when separate logistic regression was used to judge the effect of the number of groups on estimates empirical coverage rates.  In the same way, Table 5 reveals the influence of group size factor on multilevel binary logistic model fixed effects and random effects estimates empirical coverage rates under PQL method of estimation. The group size factor played a dominant role in the reduction of estimates non-coverage rates. A separate logistic regression was used to judge the effect of group size factor conditions on estimates empirical coverage rates. The coverage rates were significantly affected by the group size factor. P-values indicate the impact of group size on estimates empirical coverage rates. Additionally, Table 6 highlights the influence of ICC on fixed effects and random effects estimates empirical coverage rates of the multilevel binary logistic model under PQL method of estimation. The influence of different levels of ICC was significant in most of the conditions on both fixed effects and random effects estimates empirical coverage rates when separate logistic regression was used to judge the effect of different levels of ICC on estimates empirical coverage rates. Table 7 lists the power rates for multilevel binary logistic model fixed effects estimates under ML method of estimation. The lowest power rates were recorded for all the fixed effects estimates when the number of groups was 30. The power increased substantially with the number of groups. With 100 groups, power rates were well above 0.90.With 120 groups, power rates were 100% in majority of the conditions under ML method. On the contrary, PQL fixed effects estimates power rates were lower than that of ML fixed effects estimates power rates, on average. Power rates increased with the number of groups under both methods of estimation. Table 8 lists the power rates for multilevel binary logistic model fixed effects estimates under PQL method of estimation.

Conclusions
In ML method of estimation, the fixed effects estimates were unbiased even with 30 groups; however, the accuracy of the standard errors of fixed effects estimates was achieved when the number of groups was 50. In addition, random effects estimates were underestimated, particularly with 30 groups. Unlike fixed effects estimates standard errors, the accuracy of the random  effects estimates standard errors was achieved when the number of groups was 120. Overall, the influence of the number of groups was significant on the accuracy of multilevel binary logistic model estimates and their standard errors. However, group sizes effect was insignificant in most of the conditions on the accuracy of estimates and estimates standard errors. The present study not only confirms (50/50 rule, i.e. minimum of 50 groups and 50 units per group under ML method of estimation) of Moineddin et al. [9] but also suggests that 120 groups and a group size of 50 is mandatory for obtaining of sufficient accuracy of random effects when prevalence of the outcome is around 20 percent. Additionally, the influence of the number of groups was substantial on empirical power rates of fixed effects estimates under ML method of estimation. The power rates for all the fixed effects estimates increased with an increase in the number of groups. The results obtained in this study are parallel to the previous studies when the response variable is continuous [22][23].
On the other hand, the fixed effects estimates had the largest biases when group size was at the lowest, i.e. 5 under PQL method of estimation. However, their biases were negligible when group size was 50. Unlike the ML method of estimation, the group size was the most significant factor that influenced the accuracy of multilevel binary logistic model estimates and estimates standard errors. Furthermore, the accuracy of the fixed effects estimates standard errors    accuracy. In addition, it is also recommended that 120 groups and a group size of 70 may be used to achieve sufficient accuracy for the variance components estimates and their standard errors when prevalence of the outcome is around 20 percent. Larger ICC values also decreased the accuracy of estimates and their standard errors. Similarly, like ML method of estimation, the power rates for the multilevel binary logistic regression model fixed parameter estimates increased with the number of groups. The power rates of PQL method of estimation was also on the lower side as compared to ML method power rates. Across all conditions, PQL method estimates and estimates standard errors of multilevel binary logistic model are not comparable to that of ML method. On the basis of the present study results, it is, therefore, recommended that PQL method for binary outcome variable may be avoided in situations such as low prevalence of outcome, larger values of random effects and even when group sizes are 50 or less. However, the significance of Penalized Quasi Likelihood method of estimation was ignored earlier which proved to be an extremely effective method when random effects are small.