Separating interviewer and area effects by using a cross‐classified multilevel logistic model: simulation findings and implications for survey designs

Cross‐classified multilevel models deal with data pertaining to two different non‐hierarchical classifications. It is unclear how much interpenetration is needed for a cross‐classified multilevel model to work well and to estimate the two higher‐level effects reliably. The paper investigates this question and the properties of cross‐classified multilevel logistic models under various survey conditions. The effects of different membership allocation schemes, total sample sizes, group sizes, number of groups, overall rates of response and the variance partitioning coefficient on the properties of the estimators and the power of the Wald test are considered. The work is motivated by an application to separate area and interviewer effects on survey non‐response which are often confounded. The results indicate that limited interviewer dispersion (around three areas per interviewer) provides sufficient interpenetration for good estimator properties. Further dispersion yields only very small or negligible gains in the properties. Interviewer dispersion also acts as a moderating factor on the effect of the other simulation factors (sample size, the ratio of interviewers to areas, the overall probability and the variance values) on the properties of the estimators and test statistics. The results also indicate that a higher number of interviewers for a set number of areas and a set total sample size improves these properties.


Introduction
In face-to-face surveys interviewers play a crucial role in gaining responses from sample members (Blom et al., 2010;Durrant and Steele, 2009;Durrant et al., 2010;Campanelli and O'Muircheartaigh, 1999;Hox and De Leeuw, 2002;Pickery and Loosveldt, 2002;Pickery et al., 2001;Haunberger, 2010). Interviewer characteristics that have been found significant in explaining the response include age (Blom et al., 2010;Hox and De Leeuw, 2002), gender (Hox and De Leeuw, 2002;Hansen, 2006), interviewers' education (Durrant et al., 2010;Haunberger, 2010), pay grade and years of experience (Durrant et al., 2010;Hox and De Leeuw, 2002;Hansen, 2006) and attitudes regarding the persuasion of reluctant respondents (Blom et al., 2010;Durrant et al., 2010). Area effects on non-response may arise due to environmental factors, such as physical accessibility and urbanicity (Haunberger, 2010), as well as similarities in sociodemographic, socio-economic and cultural characteristics of households, and perception of privacy, crime and safety. To some extent area effects may therefore be considered to be aggregated household effects. In fact, studies by Campanelli and O'Muircheartaigh (1999) and Durrant et al. (2010) show that area random effects are not significant after controlling for fixed household level effects in a cross-classified model. Significant area effects in the literature include an indicator of urbanicity or rurality (Blom et al., 2010;Durrant et al., 2010), region (Haunberger, 2010) and the proportion of non-white race population in the area (Campanelli et al., 1997).
Since interviewers usually work in a restricted geographical area any interviewer effect that is identified could simply reflect area differences in the geographic propensity to co-operate in survey requests. Therefore, a particular estimation problem pertains to the identifiability of area and interviewer variation. In a random experiment an interpenetrated sample design would be employed, where each sampled case is allocated randomly to interviewers irrespectively of their area. This is considered the gold standard for separating interviewer effects from area effects for face-to-face surveys but is not implemented in survey practice owing to restrictions in field administration capabilities and survey costs (Schnell and Kreuter, 2005;Campanelli and O'Muircheartaigh, 1999). A compromise which is achievable in a real survey setting is partial interpenetration. Partial interpenetration exists where interviewers are not fully nested within areas, as one interviewer may work in more than one area, and sampling cases in one area may be designated to more than one interviewer. In the case of partial interpenetration a cross-classified multilevel model specification which considers both interviewer and area random terms has been suggested to distinguish between the two sources of variation (Von Sanden, 2004). Although a range of references have used such models to distinguish between area and interviewer effects Durrant et al., 2010;Schnell and Kreuter, 2005), it is unclear how much interpenetration may be needed for a cross-classified multilevel model to work well and to estimate interviewer and area effects reliably. In particular, in circumstances of small sample sizes and low degrees of interpenetration in the dispersion of interviewers across areas, problems of biased estimates and low power for tests of significance may arise. Some previous studies (Maas and Hox, 2005;Moineddin et al., 2007;Paccagnella, 2011;Rodriguez and Goldman, 1995;Theall et al., 2011) have looked at the properties of estimators and the power of significance tests for two-level models. However, questions regarding how well crossclassified multilevel model parameters can be estimated have not yet been explored.
This study examines the implications of various practical limitations in the assignment of cases from different areas to interviewers within a range of scenarios through a simulation study. These scenarios include different total sample sizes, group sizes (interviewer case-load), number of groups (number of interviewers), overall rates of response and the percentage variance that is attributable to area and interviewer effects. Interviewer-area classifications are restricted to possible interviewer work allocations, and selected values for the other factors represent realistic values, making the simulation results relevant to survey practice. The implications are assessed in terms of bias, confidence interval coverage, correlation of the two variance estimators and power of significance tests. The study also examines the smallest interviewer pool and the most geographically restrictive and cost-effective interviewer case allocation that are required for acceptable levels of bias and power for typical survey scenarios. By suggesting minimal sample sizes and interviewer dispersal patterns to guide survey design and administration, and by shedding light on the accuracy and precision of the estimates and the power of their tests of significance in multilevel modelling, this study contributes to different areas of research: study design and parameter estimation (Paccagnella, 2011).
Although the factor conditions and the application that are considered here are specific to survey design and the exploration of interviewer effects on non-response, the same problem of identifiability may arise in other settings. For example, other survey design applications may consider the variation in the response to questionnaire items that is attributable to interviewers, with the aim of quantifying any interviewer influence on responses (measurement error).
The remainder of the paper is structured as follows. First a review of multilevel models and their mathematical properties are presented, followed by an explanation of the use of multilevel models for the analysis of interviewer effects on non-response. Then a review of previous work exploring the properties of cross-classified and two-level hierarchical models is given. Section 3 presents the details of the design of the simulation study and the analysis carried out. Next results are presented, followed by a discussion and conclusions.

Model specification
The independent errors assumption in standard regression analysis is often not valid for social science data. Individual observations which pertain to a common higher-level grouping-such as school, family, neighbourhood or work organization-may have similarities arising from the common context which give rise to dependence between their observations. Multilevel modelling allows for an extension of the error term that is included in standard regression analysis to be able to adjust for such dependences (Goldstein, 2011). Consequently, multilevel models allow the variation in the outcome variable to be partitioned into various sources, these being both individual and group sources. Group similarities are considered as substantively interesting rather than as a model assumption infringement which needs to be accounted for, thus allowing the exploration of significant individual and group influences as well as any possible interactions between these two factors on the individual level outcome of interest. As well as allowing for a detailed analysis of predictors defined at the cluster level (also called contextual effects) through the inclusion of a higher-level random effect and contextual or aggregate fixed effects, multilevel models allow for the inclusion of data at the individual level. This helps to avoid loss of information at the individual level, a smaller sample size and the risk of ecological fallacy from analysing aggregated data. Such models do not assume that all contextual effects are included through observable predictors as in a contextual analysis and avoid restricting inference to the groups that are sampled in the data and the inclusion of a large number of dummy variables as in a fixed effects model. Multilevel models also offer more flexibility than other methods to account correctly for a complex structure.
The general form of a logistic multilevel model for purely hierarchical data with two levels is Here π ij is the probability that individual i in cluster j takes a value of 1 for the y-variable, where y is a dummy variable indicating whether a person experienced an event or has a particular characteristic. β 0 represents the overall intercept in the linear relationship between the log-odds of y and the predictor variables that are included in the model, X ij , and is the log-odds for an individual pertaining to the reference categories of the categorical variables, having a value of 0 on continuous variables and belonging to the average higher-level group (a group with a value of 0 for the higher-level random effect u j /. The vector β 1 contains the coefficients for each explanatory variable in the model when all other predictor variables are controlled for. These coefficients are also known as the cluster-specific effects of the explanatory variables, since they represent the effect of a unit increase in the covariate on the log-odds that the individual has a value of 1 on the outcome variable y, for a constant value of u j and therefore within the same higher-level group j. The vector X ij represents the predictor variables which may be defined at the individual or cluster level. The predictor variables may also include interaction effects or cross-level interaction effects. The u j represent the random effects for the higher-level classification units, which are assumed to follow a normal distribution with mean 0 and variances σ 2 u . Besides purely hierarchical structures, multilevel models can also deal with data pertaining to two different non-hierarchical classifications (cross-classifications) (Fielding and Goldstein, 2006). The general form of such a cross-classified multilevel logistic model is Here π i.js/ is the probability that individual i in clusters j and s takes a value of 1 for the y-variable.
The parameters u j and v s represent the random effects for each higher-level classification, which are assumed to follow a normal distribution with variances σ 2 u and σ 2 v .

Review of properties of cross-classified models
For the case of cross-classified multilevel models, sample size requirements and the level of interpenetration that is required between the two cross-classified higher-level classifications necessary for accurate parameter estimation have not yet been considered. What is currently available is a software package which produces power calculations for various sample sizes, data structures and random-effects sizes-MLPowSim (Browne and Golalizadeh, 2009). This package is limited to power calculations and does not have the ability to provide the other properties that are considered in the current paper. For cross-classified models the estimation is carried out in R by using the lmer function. The most flexible template for cross-classified data in MLPowSim enables the user to specify the total sample size, the number of higher-level groups, the probabilities of sampled cases pertaining to each higher-level combination and the expected variances. The MLPowSim manual includes an example with examination attainment at age 16 years-a continuous variable-chosen as the outcome variable, where each student is associated with both a primary and a secondary school. For this particular application, results show that sampling additional cases (students) from new higher-level groups (schools) results in greater increases in power than sampling additional cases from higher-level groups that are already included in the sample. Also, adding further cases per higher-level grouping benefits power calculations up to only a threshold number of cases. Several references exist (Rodriguez and Goldman, 1995;Paccagnella, 2011;Moineddin et al., 2007;Theall et al., 2011;Maas and Hox, 2005) that assess the effect of various factors, including sample size and outcome probability, on the properties of two-level hierarchical model estimates-for both continuous and binary outcome variables-through simulation studies. By definition two-level models do not include data pertaining to two classifications, and therefore the effect of interpenetration on model estimates cannot be reviewed in these studies. The results of these studies will be presented in Section 5 when reviewing this paper's results for cross-classified models.

Estimating area and interviewer effects
The analysis of interviewer effects has become a popular application of multilevel methods (Von Sanden, 2004). Sample cases are nested within interviewers. However, interviewers generally work in a limited geographic area and, to the extent that people from certain areas are more or less likely to co-operate, significant interviewer effects may simply indicate area effects. Moreover, there may also be area effects on non-response, arising from similarities in socio-economic and cultural characteristics, in the perception of privacy, crime and safety, as well as in environmental factors such as physical accessibility and urbanicity across geo-graphic boundaries (Haunberger, 2010). One approach to estimate interviewer effects in the past has been simply to ignore area effects (Pickery and Loosveldt, 2002;Haunberger, 2010;Blom et al., 2010) which clearly could yield misleading results and may overstate the effect of interviewers. Few studies have attempted to disentangle interviewer and area effects by specifying a cross-classified multilevel model for multistage cluster sample design data Durrant et al., 2010). This model specification prevents confounding only when the data are partially interpenetrated, which means that interviewers are not fully nested within areas, with interviewers working in more than one primary sampling unit, and cases in one primary sampling unit designated to more than one interviewer. In this particular application of sample units within interviewers operating within sampling areas, the higher-level variance is divided into two parts-the interviewer level variance and the area level variance.

Design of the simulation study
First, the process by which the data in the simulation study are generated is presented. Then, the cross-classified multilevel logistic regression model which is fitted to the simulated data is presented. The various simulation scenarios and the design factor values that are considered are then specified. Finally, the measures that are used to assess the properties of the estimators and test statistics, including the rationale for considering each measure and the equations that are used for their calculation, as well as model diagnostics, are presented.

Data-generating procedure
In this simulation study the focus is on the random-parameter estimates, and therefore only an overall intercept β 0 is included as a fixed effect. Its value is determined after considering the overall probability of the outcome for the mean area and the mean interviewer, π, by using the formula . 3/ This value is fixed for all cases. Then, a cluster-specific random effect for each interviewer and area is generated separately from a normal distribution of mean 0 and variances σ 2 u and σ 2 v respectively. The log-odds of each case, η i.js/ , are computed by adding the overall intercept value to the simulated random effects. These values are then converted to probabilities by using the equation Values of the dependent variable Y i.js/ , a dichotomous outcome-with 0 signifying non-response and 1 signifying response to the survey request-for each case, are generated from a Bernoulli distribution with probability p i.js/ . For scenarios which vary only in the interviewer case allocation the same set of 1000 clusterspecific random effects is used. This strategy underlies the fact that, whereas interviewers are assumed to come from an infinite population, the allocation of workload from different areas to specific interviewers is limited to a finite number of possibilities. On the basis of the usual design of a two-stage clustered household survey, areas would not normally be expected to be adjacent. Hence, areas are assumed to be independent in the design of this simulation study.

Estimation of the multilevel cross-classified model
The following multilevel cross-classified model is then fitted to the simulated data to identify interviewer and area random effects (without covariates for simplicity): where the interviewer-specific residuals u j are distributed N.0, σ 2 u / and the area-specific residuals v s are distributed N.0, σ 2 v /. The analyses of the simulated data sets are carried out by using Stata version 12 calling MLwiN version 2.25 through the runmlwin command (Leckie and Charlton, 2011). Models are fitted by using the Markov chain Monte Carlo estimation method with default priors, and, depending on the rate of convergence, a burn-in length of between 5000 and 10 000 and between 200 000 and 500 000 iterations. Initial values for parameters are obtained by making use of the second-order penalized quasi-likelihood estimation method.

Simulation scenarios
To explore the properties of estimators, a simulation experiment is carried out using a factorial design. The simulated scenarios vary in the following factors: overall sample size N, number of interviewers and areas, N I and N A , and by consequence number of cases per interviewer and per area, level of cross-classification between interviewer and area allocations, higher-level variance and overall probability of the outcome variable, π.
The choice of the values for the various factors reflects realistic representations of general household survey scenarios. N A in this simulation study will not be altered for a specific N. The initial numbers that are chosen for N, N A and N I , are based on the values that are obtained from a real survey and slightly adapted to obtain numbers which are easily divisible to obtain balanced designs. The main design, which will be referred to as the baseline scenario design, includes 120 areas consisting of 48 cases per area allocated to 240 interviewers who each have a workload of 24, totalling 5760 cases, with the area variance σ 2 v = 0:3, interviewer variance σ 2 u = 0:3 and an overall probability π = 0:8. The effect of different interviewer-area classifications-varying in terms of the number of areas that each interviewer works in (and consequently the number of interviewers per area) and the overlap of interviewers working in neighbouring areas-on the properties of the estimators and test statistic for the baseline scenario factors is analysed. The number of areas that each interviewer works in will be allowed to vary from 1 to 6.
For illustration, Tables 1-3 show the area-interviewer allocations for the first six areas. The areas are considered as sequential numbers in a circle, with the final area-area 120-neighbouring the first area-area 1. Each box represents an area and the numbers within each box represent the interviewers working within that area (numbers from 1 to 240). The simplest case-case 1-is where two interviewers work in each area, with each interviewer working only in one area (Table 1). In this case, there is no overlap in neighbouring areas with respect to Table 1. Area-interviewer allocations for case 1 Table 2. Area-interviewer allocations for case 2 the interviewers working within them. This in fact represents a purely hierarchical model, with individuals nested in interviewers which in turn are nested in areas. For scenarios with an equal number of interviewers and areas these two variables are confounded.
Next, an interviewer can work in two areas, with four interviewers working in each area (Table  2). Three possible scenarios exist. The most overlap occurs for the scenario which allocates the same set of four interviewers to work in two neighbouring areas (case 2(a)). Alternatively, groups of three interviewers are repeated in two neighbouring areas with a fourth interviewer varying in the two areas (case 2(b)). Thirdly, pairs of interviewers are always allocated together, with each particular pair never occurring twice with another pair (case 2(c)).
Similar allocation patterns are considered for schemes where the interviewer works in three or more areas (diagrams are not presented). The same basic principle of decreased overlap as one moves from the allocation (a) to subsequent allocations applies. For cases where interviewers work in three areas and each area includes six different interviewers, seven different allocation possibilities are considered. With interviewers working in more areas, less variations of overlap are considered, and this is simply due to the feasibility of such allocation schemes in practice. Three allocation schemes are considered for situations when each interviewer works in four, five and six areas, and cases within each area are allocated to eight, 10 and 12 different interviewers respectively.  Owing to computer power limitations and dependences between factors-such that for a fixed sample size a change in the number of clusters (interviewers or areas) also changes the number of cases per cluster and the level of cross-classification between the two higher-level classifications, it was impossible to consider all factor combinations. Only one simulation factor at a time is changed, keeping all other factors constant. Table 4 outlines the baseline values as well as the other values that were considered for each factor in the simulation study.
The analysis for the initial baseline scenario design, containing 5760 cases, highlights a need to consider a smaller N. New data sets, amounting to a half and a quarter of the original baseline scenario case-load (2880 cases from 60 areas allocated to 120 interviewers and 1440 cases from 30 areas allocated to 60 interviewers), are also generated. For the baseline scenario there are twice as many interviewers as there are areas: N I = 2N A . Another alternative considered is to have an equal number of interviewers and areas, N I = N A , i.e. 120 interviewers for 120 areas for N = 5760. For this data structure only six interviewer-area allocation schemes are considered, varying from the most geographically restrictive case where one interviewer works only in one area, to the most sparse where each interviewer works in six areas. In this case, variations in the amount of overlap in the groups of interviewers allocated to each area are not attempted, and the allocation schemes always allow the same group of interviewers to work together in neighbouring areas. These allocation schemes shown in Table 3, denoted as case a, where a represents the number of areas that each interviewer works in, are therefore comparable with the allocation schemes case 2(a) that were outlined above.

Evaluation: properties of the estimators and test statistics
The models are assessed in terms of the following properties: the correlation of the two variance estimators, the percentage relative bias, the mean-squared error, the confidence interval coverage and the power of tests. The covariance between the area and interviewer variance estimators is a quality measure in itself. For easier interpretation the correlation ρ for each data set is calculated by using the formula . 6/ 'Good' estimators are expected to show no substantial correlation. High negative correlation values indicate problems with the identifiability of the two variance parameter estimates. In such cases the model may correctly estimate the total higher-level variance, which is the sum of the interviewer and area variances, but incorrectly apportion the variance to the two higher-level classifications, producing biased estimates for the individual random parameters. One estimate would be overestimated, and the other estimate would be underestimated, resulting in a negative correlation. Negative correlation values of −0:1 or higher will be considered problematic. Browne et al. (2001) made reference to this problem and referred to it as the collinearity of random terms, and identified 'poor mixing properties and high negative cross-chain correlations' (page 14) as good identifiers of this problem.
The percentage relative bias of the parameter is calculated to determine the accuracy of the parameter estimators. We consider only estimated percentage relative bias values above 3% as substantial. Standard error accuracy is assessed by using the coverage method (Maas and Hox, 2005), where coverage of the true parameter value within the 95% Wald confidence interval of the parameter estimate for each simulated data set is recorded separately. The coverage rate is recorded for all simulation scenarios and compared with the nominal rate of 95%. The null hypothesis, specifying the true parameter value to be 0, is tested for both variance parameters of each simulated data set by using the Wald test. The power of a test indicates the probability that the null hypothesis is correctly rejected. Maas and Hox (2005) explained that basing the testing of significance for variance parameters by using the asymptotic standard error is not ideal. Such a test is based on normality assumptions. Testing of the null hypothesis, which specifies the random parameter to be equal to 0, lies on the boundary of the permissible parameter space, since variances can only be positive. Standard likelihood theory no longer holds at this boundary. However, this practice is widely used and justifies its use in this simulation study. In calculating the power for the variance parameters the p-values are halved, since variances cannot be negative, and therefore the alternative hypothesis is one sided (Snijders and Bosker, 1999).

Diagnostics
Results may be affected by convergence issues or inappropriate starting values. To identify the appropriate length of burn-in and to avoid undue influence from the starting values different burn-in lengths are explored for each scenario for a sample of the simulated data sets (Gelman et al., 2004). Similarly, the Brooks-Draper and Raftery-Lewis diagnostics (Browne, 2012) are checked to identify the length of chain that is required for accurate point estimates and 95% credible intervals. As a further check, these diagnostics, obtained and saved for each model run, are inspected for convergence. For each scenario a visual inspection of some of the trace plots of the parameters is also carried out.

Results
To remind the reader of what has been outlined in Section 3, the baseline scenario design has the following properties: 120 areas (48 cases per area) allocated to 240 interviewers (24 cases per interviewer), totalling 5760 cases, σ 2 v = 0:3, σ 2 u = 0:3 and π = 0:8. Generally one or two factors from σ 2 v and σ 2 u , π, N and the ratio of interviewers to areas (dependent on N I and N A ) are changed for every new scenario. For every specific set of factor values different interviewer allocation schemes are specified, giving rise to more scenarios. The effect (or lack of effect) of these factors on the properties of the estimator is reviewed. General patterns are documented and any possible interactions between factors highlighted.
The properties for the overall intercept β 0 showed relatively stable results across different factor values. Under all simulation scenarios the test for β 0 obtains a power of 1. Accurate intercept estimatesβ 0 are obtained even for small N and very geographically restrictive interviewer allocation schemes. The Wald coverage rates are close to the 95% nominal rates   across all scenarios. Consequently, the analysis of the effect of various factor changes for the above-mentioned properties will be restricted to the random parameters. These patterns for each property across factors will now be summarized.

Power of test
For the baseline scenario design the power of the Wald test at the 5% level of significance is equal to the optimal value of 1 for all interviewer case allocation possibilities for both random parameters except for the test for σ 2 v for the least sparse interviewer allocation (case 1) which yields a power of 0.91 (Table 5, second and third columns).
Interviewer dispersion is the factor which shows the greatest influence on power. For scenarios with one interviewer per area allocation scheme power is observed to decrease to 0 for certain scenarios (Table 5, first row), as expected given the exact collinearity between areas and interviews, whereas the lowest power observed for two interviewers per area allocation schemes is 0.67 (Table 5, second row 2). There is a threshold, which varies for different factor value combinations, beyond which further dispersion does not yield power gains. In contrast, reduced interviewer overlap for a constant number of areas per interviewer does not improve the power (Table 6). Here overlap refers to the extent that the group of interviewers working in neighbouring areas is the same, such that case 2(a) has greater overlap than case 2(c). (Complete results for Table 6 can be found in the on-line appendix Table A1.) Tables 5 and 6 show that for scenarios with smaller N, but keeping constant all other factors, lower power is obtained for the allocation schemes with the least interviewer dispersion (the number of areas that an interviewer works in). Therefore, sparser interviewer allocation schemes are required to obtain similar high levels of power for scenarios with a smaller N. The effect of sample size reduction on power is greater for N I = N A -scenarios (Table 5, second-seventh columns) compared with N I = 2N A -scenarios (Table 5, eighth-12th columns).
When only the overall probability varies, with other factors kept constant at their baseline values, for case 1 scenarios more extreme overall probabilities result in lower power for the random parameters σ 2 v and σ 2 u (on-line appendix Table A2). Extreme overall probabilities seem  to have a greater effect on the power of tests for random-effects parameters which have a smaller number of higher-level units in the sample, i.e. the area random parameter σ 2 v compared with the interviewer random parameter σ 2 u (on-line appendix Table A2). The number of higher-level units as well as interviewer dispersion moderate the effect of a lower variance on the power of the test for the random parameter, such that the only difference in power across different variances is observed for case 1 for the area variance parameter (on-line appendix Table A3). Increasing the area variance while maintaining constant the interviewer variance results in higher power. Interestingly, for case 1 for a specific value of the area variance higher power for the test of the area parameter is obtained when the interviewer variance is smaller For N I = 2N A -scenarios, where substantial differences can be noted for the power of the tests for the random parameters, the power for the area parameter σ 2 v is consistently lower than that for the interviewer parameter σ 2 u (Table 5). No difference is observed for N I = N A -scenarios. These results indicate that the number of higher-level units moderates the effect of a lower intracluster correlation ICC on the power of the tests for the random parameters.
The ratio of interviewers to areas also influences the power for the random parameters. Scenarios having N I = N A require more interviewer dispersion than equivalent N I = 2N A -scenarios to obtain the same power for the random parameters (Table 5).

Correlation between random-parameter estimators
Interviewer dispersion highly influences ρ between the two variance estimators. High negative correlations (greater than 0.4 and up to a maximum of 0.91) are obtained for all scenarios when interviewers are working in only one area (Tables 7 and 8, first row). This correlation is reduced to less than −0:2 once interviewers work in two areas (Tables 7 and 8, second row).
No effect of sample size on ρ is observed for allocation schemes which allocate interviewers to at least two areas (Table 7). For case 1 (Table 7, first row) the correlation varies across N, mainly for N I = N A -scenarios Therefore, the effect of N on ρ is moderated by the number of  higher-level units, or an unequal ratio of the two higher-level units, as well as the interviewer dispersion. Scenarios with equal numbers of areas and interviewers obtain higher negative correlations than scenarios with twice the number of interviewers to areas (Table 7). This difference may be explained in terms of improved identifiability of the variance decomposition for scenarios with higher numbers of clusters, or alternatively an unequal number of clusters for the two classifications.
The negative correlation increases, monotonically but not linearly, with increasing overall probabilities (Table 8). For allocation schemes with at least three areas per interviewer the effect of overall probability on ρ is marginally lower.
Higher area variance values result in lower negative correlations for the more restrictive interviewer allocation schemes, whereas no trend is identified when varying the interviewer variance (on-line appendix Table A4). These results suggest that the number of higher-level units that are associated with a variance parameter moderates the effect of the variance on ρ.
Lower negative correlation is obtained for the two areas per interviewer allocation schemes which have less overlap (Table 8, second and third rows). There seems to be no effect for interviewer overlap for more dispersed interviewer allocation schemes. This result indicates that the effect of interviewer overlap is moderated by the interviewer dispersion, i.e. the number of areas that an interviewer works in.

Percentage relative bias of parameter estimators
In most scenarios with N = 5760, the relative percentage biases for the variance parameter estimators are around 1-3% once interviewers have been allocated work in at least two areas ( Table  9). The bias is much higher for interviewer allocation schemes which restrict the interviewer to working in one area (case 1). The biases for cases 2-6 fluctuate within the range that was specified above, failing to show any systematic reduction with further dispersion and less interviewer overlap. For interviewer case allocation schemes in which interviewers are working in at least two areas, the area random parameter σ 2 v -bias is almost always greater than the interviewer random parameter σ 2 u -bias (Table 9, second-seventh columns, second-sixth rows). This again confirms the importance of group size for the accuracy of parameter estimators.
As expected, greater biases for the σ 2 v -and σ 2 u -estimators are observed for smaller N, with the scenario including 1440 cases with N I = N A obtaining biases between 5% and 13% for all allocation schemes (Table 9, 10th and 13th columns). Scenarios with N I = N A (Table 9, eighth-13th columns) generally obtain higher biases for both variance parameter estimators than N I = 2N A -scenarios (Table 9, second-seventh columns). This trend is observable for the interviewer parameter σ 2 u -estimator. This trend is what would be expected due to the greater number of interviewers in the N I = 2N A -scenarios compared with the N I = N A -scenarios. In contrast, for the area parameter σ 2 v -estimator-where N A = 120 in both the N I = N A -and N I = 2N Ascenarios-this pattern is less consistent for the 5760 and 2880 sample size scenarios. However, with a total sample size of 1440 the N I = N A -scenario yields consistently higher biases than the N I = 2N A -scenarios. These results may support the hypothesis that having more interviewers estimates the interviewer variance better and hence also the area variance. No clear trend for the change in bias by interviewer overlap, interviewer dispersion beyond two areas per interviewer, overall probability and by variances is observed (on-line appendix Tables A5 and A6).

Wald confidence interval coverage
The Wald confidence interval coverage rates are close to the 95% nominal rate-between 94% and 96%-in most scenarios. However, there are some cases of undercoverage (the lowest observed rate is 87%) as well as very few cases of overcoverage (the highest observed rate is 100%) for scenarios where each interviewer works in one area only. Slightly lower coverage rates are observed for smaller N in most scenarios for both σ 2 v and σ 2 u (Table 10). (Complete results for  Table 10 can be found in the on-line appendix Table A7.) Only the scenarios with the smallest sample size of N = 1440 consistently obtain non-accurate coverage rates across all interviewer case allocation schemes. However, these rates do not fall below 89% once each interviewer has been allocated work in at least two areas. Coverage rates that are closer to the 95% nominal rate for the σ 2 u -parameter are noticeable for the N I = 2N A -scenarios compared with the N I = N Ascenarios for N = 5760 (Tables 10 and 11). This improvement in the confidence interval coverage    rate with an increase in the number of interviewers from 120 interviewers to 240 interviewers no longer occurs for smaller N.
Some factors that were considered in this study do not seem to influence coverage rates. There does not seem to be a consistent pattern in the coverage rates by the overall probability or by the higher-level variances. Nor do the results show any evidence that the extent of interviewer overlap influences coverage rates.

Discussion of results
As expected, the results show worse quality estimators for smaller N. It is important to consider that in this study it is not possible to distinguish clearly between the effects of decreases in N and decreases in N A and N I , since halving the N also reduces the number of higher-level units by half. Bias has been found to increase with decreases in N, particularly when halving N from 2880 to 1440 for the N I = N A -scenarios. This is similar to the result that was obtained by Paccagnella (2011) who showed that the improvements in the estimators' accuracy with sample expansions decrease as N increases. Similarly to Moineddin et al. (2007), there is some evidence in this study of lower coverage rates for smaller N, noticeable for the 1440 sample size scenario compared with the 5760 and 2880 sample size scenarios. Power also decreases for smaller total sample sizes, though this effect is moderated by interviewer dispersion. The opposite trend can be observed for the correlation between the two random-parameter estimators, with the one area per interviewer allocation scheme showing a decrease in the negative correlation with decreasing N. This trend is more pronounced in the N I = N A -than in the N I = 2N A -scenario. However, this trend is negligible for both these scenarios once interviewers are working in at least two areas each.
The above-mentioned results on the relationship between N and the various properties show that reductions in N can be moderated to some extent by interviewer dispersion. However, small N-scenarios-1440 cases-are to be avoided as even with sparse interviewer allocation schemes they do not achieve acceptable levels of accuracy, precision and power. In contrast, large and medium-sized samples, including N I = 2N A -scenarios, obtain good estimates once interviewers work in at least three areas. The comparison of the N I = 2N A -with the N I = N Ascenarios indicates that overall it is expected that a higher number of clusters as opposed to a higher cluster size for a constant N yields better estimates. In this paper, the N does not increase as the number of groups is increased. Instead, the number of groups is altered for a set N. Lower negative correlation between the two higher-level random effects, higher power for the Wald test for σ 2 u , lower standards errors forσ 2 u and lower relative percentage bias for σ 2 u are observed for the N I = 2N A -compared with the N I = N A -scenarios for some of the least sparse interviewer allocation schemes, and especially for smaller N. The improvement in the accuracy and precision ofσ 2 v for the smallest sample size scenario and the higher power for the Wald test for σ 2 v may be indicating that, besides the effect of the number of clusters (which for the area classification remains unchanged), the ratio of higher classification units may also affect the quality of estimates with a ratio unequal to 1 performing better. This result suggests that a larger N I should be working within a set N A for best quality estimates. These results are consistent with previous simulation studies (Maas and Hox, 2005;Paccagnella, 2011;Mok, 1995) for two-level hierarchical models which emphasize the importance of a higher number of clusters, as opposed to a larger cluster size, for the quality of estimates from multilevel models. In this study lower power of the Wald test for the random parameters and higher correlation between the two random-parameter estimators are found for more extreme overall probabilities for some restrictive interviewer case allocation schemes. Both this study and Moineddin et al. (2007) suggest that estimates of lower quality are obtained for extreme values, with Moineddin et al. (2007) investigating the lower end of the spectrum and this study investigating the higher end. The effect of the overall probability on negative correlation between the two randomparameter estimators is observed only up to interviewer allocation 3(a), whereas the effect on the power of the Wald test is restricted to just the most restrictive interviewer case allocationcase 1. Therefore, some of the effects of the overall probability on the quality of estimates can be avoided during the survey administration by assigning work to interviewers in at least three areas.
In this study the size effect and direction of the effect of ICC on the quality of the estimates seems to vary for different properties. Higher ICC-values seem to decrease the negative ρ, although this is no longer noticeable for higher-level effects with a large number of clusters in the sample. In fact, lower negative ρ are observed for higher area variances σ 2 v until interviewer allocation case 3(a), but no consistent change is observed for higher interviewer variances σ 2 u in scenarios with double the number of interviewers to areas. Similarly, ICC is found to have a positive relationship with the power of the Wald test for the most restrictive interviewer case allocation, case 1, but again for the other higher-level classification with twice the number of clusters this effect is not observed. In contrast, precision seems to decrease for higher variances. Similarly to Maas and Hox (2005) and Paccagnella (2011), in this study no clear pattern for the change in the percentage relative mean bias of the variance parameter estimators by ICC is observed. Contrary to the results that were reported by Moineddin et al. (2007), in this study no evidence of the effect of ICC on the confidence interval coverage rates has been found. Similarly to the effect of overall probability on the quality of estimates, these results indicate that generally once each interviewer has been allocated cases in two, and sometimes three, different areas, small ICC-values will not be detrimental to the quality of the estimates. It is important to consider that in this study very small variances are not being investigated.
Interviewer dispersion, which refers to the number of areas that each interviewer works in, improves the quality of estimates only up to a point. The power of the Wald test at the 5% level of significance for the baseline scenario design is close to the optimal value of 1 for all interviewer case allocation schemes. For scenarios with smaller N, but keeping constant all other factors, sparser interviewer allocation schemes are required to obtain high power. Improvements in power are observed when increasing the number of areas per interviewer from 1 to 2 for N = 2880 and N = 1440, and from 2 to 3 for N = 1440. Further dispersion yields only very small gains. The correlation between the two parameter estimators is reduced to the chosen threshold of −0:1 once interviewers have been allocated to two areas for N I = 2N A -scenarios, and three areas for N I = N A -scenarios. Decreases in the relative percentage bias are substantial when comparing the case 2 with the case 1 allocation scheme. However, no systematic trend in bias reduction is observed for case 3-case 6. Confidence interval coverage rates show problems of overcoverage and undercoverage for different scenarios with the case 1 allocation scheme but are close to the 95% nominal rate for all other allocation schemes.
No consistent relationship between bias, confidence interval coverage rates and power of the Wald test with the extent of interviewer overlap is found. The only effect of interviewer overlap was restricted to the ρ-values for two areas per interviewer allocation schemes, with less overlap resulting in lower negative ρ. Consequently for the scenarios that were considered in this study, once all interviewers work in at least three areas, there is no benefit in aiming for less interviewer overlap. This result indicates that complicating interviewer case assignments by sending interviewers further from their area of residence in an attempt to avoid having the same interviewers working in the same neighbouring areas is not necessary to obtain quality estimates.

Conclusions and implications for survey design
The simulations in this paper offer new insight into the performance of the advanced multilevel models for realistic survey design conditions. This paper is the first work investigating the properties for cross-classified models under different survey design conditions. This paper has identified trends in the properties of the estimators and test statistics across a range of values for the simulation factors that were considered.
This paper indicates that, as expected, purely hierarchical data are subject to substantial biases, high negative correlations between the two random-parameter estimates, undercoverage and overcoverage of the Wald confidence interval and low power of the Wald test. Interpenetration of the higher-level groups at the design stage is required to allow for the two higher-level effects to be disentangled when estimating these effects by using multilevel cross-classified models. For the scenarios that were considered in this paper limited overlap of the higher-level groups (of around three areas per interviewer for medium or large sample sizes) has been shown to provide sufficient interpenetration for good properties. Further dispersion yields only very small or negligible improvements in the properties. Overlap of the higher-level groups also acts as a mediating factor on the effect of the other simulation factors (sample size, the ratio of interviewers to areas, the overall probability and the variance values) on the properties of the estimators and test statistic. Reductions in total sample size can be moderated to some extent by interviewer dispersion. However, small N-1440 cases-are to be avoided as even with sparse interviewer allocation schemes they do not achieve acceptable levels of accuracy, precision and power. Importantly, the results also show that once interviewers work in at least three areas complicating interviewer case assignments by sending interviewers further from their area of residence in an attempt to avoid having the same interviewers working in the same neighbouring areas does not improve the quality of estimates. Consequently, study designs should focus on allowing some interpenetration between the two higher-level groups, while avoiding increasing survey costs or complicating logistics by disregarding the extent of overlap of higher-level units from the same classification group. Moreover, these results shed a positive light on the validity of findings presented in existing empirical studies analysing interviewer and area effects on non-response through cross-classified multilevel analysis (O'Muircheartaigh and Pickery et al., 2001;Durrant et al., 2010;Vassallo et al., 2015). This study has shown that limited interpenetration is sufficient to disentangle interviewer from area effects. The results also suggest that, for a fixed total sample size, a higher number of clusters as opposed to a higher cluster size yields better estimates. Moreover, the ratio of higher classification units may also affect the quality of estimates with a ratio unequal to 1 performing better.
It is acknowledged that the results from this paper are restricted to the factor values that were chosen and the scenarios considered. The results cannot necessarily be extrapolated to very different survey design conditions with certainty. Further research investigating different simulation factor values and data structures should be carried out to corroborate and extend existing evidence on the performance of these models. One particular area of further research should focus on an examination of these properties for very small higher-level variances. Another area for further research is the investigation of scenarios with residual correlation between areas.
The paper considers the properties of variance estimators only. The data are generated from models including an overall intercept and random effects. No explanatory variables are considered. Other simulation references reviewed earlier indicate that the worst estimator and test statistic properties are observed for the variance estimators. Consequently, the focus on the random effects is justified, as these parameters are those which are most susceptible to influence by changes in simulation factors. Moreover, scenarios achieving acceptable properties for the variance parameters can be assumed also to provide acceptable properties for fixed effect parameters. In future work the inclusion of fixed effects, especially cross-level interaction effects and contextual effects, should be considered.
This work created the procedure and R and Stata codes that can be used independently of this research project to investigate the performance of multilevel cross-classified logistic models for existing data structures, or to inform the design of future studies with similar designs. A future project may focus on creating an on-line platform, similar to the MLPowSim tool (Browne and Golalizadeh, 2009), for other users to be able to specify their data structure and to run the simulation for their own specific application.