Handling Missing Data in Cross-Classified Multilevel Analyses: An Evaluation of Different Multiple Imputation Approaches

Multiple imputation (MI) is a popular method for handling missing data. In education research, it can be challenging to use MI because the data often have a clustered structure that need to be accommodated during MI. Although much research has considered applications of MI in hierarchical data, little is known about its use in cross-classified data, in which observations are clustered in multiple higher-level units simultaneously (e.g., schools and neighborhoods, transitions from primary to secondary schools). In this article, we consider several approaches to MI for cross-classified data (CC-MI), including a novel fully conditional specification approach, a joint modeling approach, and other approaches that are based on single- and two-level MI. In this context, we clarify the conditions that CC-MI methods need to fulfill to provide a suitable treatment of missing data, and we compare the approaches both from a theoretical perspective and in a simulation study. Finally, we illustrate the use of CC-MI in real data and discuss the implications of our findings for research practice.

observed data and an imputation model (Rubin, 1987). A key requirement of MI is that the imputation model correctly takes into account the data structure and the relationships between the observed variables. This can be particularly challenging when the data have a clustered structure, in which observations are organized within higher-level units (e.g., students in schools, employees in teams, repeated measures in individuals). Although a number of studies have considered applications of MI in clustered data, much of this research has been concerned with hierarchical data, such as two-and three-level data (Enders et al., 2016;Goldstein et al., 2009;Grund et al., 2018b;Lüdtke et al., 2017;Schafer & Yucel, 2002;Wijesuriya et al., 2020). By contrast, the use of MI applications with nonhierarchical data structures, such as cross-classified or multiplemembership structures, is still poorly understood (for an overview, see Rasbash & Browne, 2008).
The purpose of this article is to investigate the effectiveness of MI for the treatment of missing values in cross-classified data, wherein observations can belong to multiple clusters that do not form a clear hierarchy (Goldstein, 2011;Raudenbush & Bryk, 2002). Cross-classified data are common in many areas of research, for example, in cross-sectional data when individuals are organized in multiple higher-level units (e.g., students in schools and neighborhoods; employees in teams and fields of expertise; see also Claus et al., 2020;Fielding & Goldstein, 2006) or in longitudinal data when cluster membership changes over time (e.g., students in primary and secondary schools; see also Cafri et al., 2015). The treatment of missing data in cross-classified data can be extremely challenging because the variables can be observed at different levels, and the cross-classified structure implies a complex pattern of dependency between the observations that can no longer be captured by hierarchical models.
In writing this article, we have three major goals. First, we aim to clarify the requirements that imputation approaches need to fulfill in order to provide a suitable treatment of missing values in the analysis of cross-classified data. In this context, we focus on analyses with random intercepts and linear effects, and we discuss how the imputation approaches can be extended to address additional types of analyses. Second, we aim to compare the statistical properties of different MI approaches for cross-classified data (CC-MI) from both theoretical and practical perspectives and by using the results of a simulation study. To this end, we (a) introduce a novel approach to CC-MI that is based on the fully conditional specification (FCS) approach to MI and (b) outline a Bayesian joint modeling (JM) approach for the treatment of incomplete cross-classified data. Third, we illustrate the application of CC-MI in a worked example with real data from education research, provide recommendations, and outline limitations and extensions of the approaches that we considered.
This article is organized as follows. In the first section, we provide a brief introduction to the structure and analysis of cross-classified data. In doing so, we try to outline the most important structural features of cross-classified data and explain how they can be analyzed with cross-classified random-effects models (CCRMs). Next, we present the JM and FCS approaches to CC-MI and explain how these methods accommodate the structural properties of cross-classified data. In this context, we also consider ad hoc approaches that extend conventional methods for single-and two-level MI to better accommodate crossclassified data and that can be implemented in a wide variety of statistical software. Then, we present the results of a simulation study, in which we evaluated the statistical properties of these methods. Finally, we demonstrate the application of the methods in an example with computer code and real data, and we discuss the implication of our findings for the treatment of incomplete crossclassified data.

Cross-Classified Data
Cross-classified data are characterized by a clustered structure, in which observations belong to multiple clusters simultaneously. For example, consider the hypothetical scenario in Figure 1, where students are clustered within the units of two random factors: schools (A) and neighborhoods (B). In such a case, students who attend the same school or live in the same neighborhood often tend to be more similar to each other than to students in different schools or neighborhoods because they are exposed to similar contextual influences. Both factors represent higher-level units, and we refer to these levels as Levels A and B. However, in contrast to the clustering that occurs in hierarchical data, the two factors are crossed and not nested within one another. In other words, the students are cross-classified by neighborhoods and schools. This is reflected by the fact that the students at any particular school sometimes live in different neighborhoods, and the students in any particular neighborhood sometimes attend different schools. The combined membership of each student in one school and one neighborhood forms a number of school-neighborhood pairs, in which a certain number of students are nested. We refer to this intermediate level as

Cross-Classified Multiple Imputation
Level AB. Conceptually, the school-neighborhood pairs may correspond to more tightly knit communities or peer groups that share additional contextual influences that are not shared by other students who attend the same school (but live in different neighborhoods) or live in the same neighborhood (but attend different schools).
The school-neighborhood assignment shown in Figure 1 is an example of partial cross-classification, in which every school is crossed with only a subset of the neighborhoods and vice versa. The partially cross-classified structure is reflected by the fact that each neighborhood sends students to only some of the schools, and each school recruits students from only some of the neighborhoods. For the example above, the assignment of schools and neighborhoods into school-neighborhood pairs is illustrated in more detail in Figure 2. By contrast, a full cross-classification would occur if every school received students from all the neighborhoods or, equivalently, if every neighborhood sent students to all the schools.
Examples of cross-classified data can be found in many areas of research. For example, in education research, cross-classification can occur when schools are crossed with other organizational units, such as neighborhoods or families (e.g., Dundas et al., 2014;Dunn et al., 2015;Garner & Raudenbush, 1991). In longitudinal data, cross-classification occurs when students transition from one type of school to another (e.g., primary and secondary school; Goldstein & Sammons, 1997;Paterson, 1991) or switch classes or teachers over time (e.g., Gregory & Huang, 2013;Heck, 2009;Kyriakides & Creemers, 2008). Finally, crossclassified data can also occur in other areas of research, such as in clinical and organizational research (Barker et al., 2020;Claus et al., 2020), in experimental research (Baayen et al., 2008), or in the context of generalizability theory (Cronbach et al., 1972;Shavelson & Webb, 2000). FIGURE 2. Example (continued) for cross-classified data with students (Level 1) clustered in schools (Level A) and neighborhoods (Level B). The two panels represent the ties between schools and neighborhoods (a) and the number of students for each schoolneighborhood pair (b).

Cross-Classified Random-Effects Models
One of the most popular methods for analyzing cross-classified data is the CCRM. Suppose that a researcher is interested in the relationship between an explanatory variable x and an outcome variable y in a sample of students (Level 1) who are clustered within schools (Level A) and neighborhoods (Level B). For example, y may represent students' academic achievement, whereas x may represent their socioeconomic status (SES).
Intercept-only model. A common first step in the analysis of cross-classified data is to estimate the components of variance in y that can be attributed to differences between schools, neighborhoods, and school-neighborhood pairs, for example, by using the following intercept-only model (Raudenbush & Bryk, 2002). For student i (i ¼ 1; . . . ; n jk ) in school j ( j ¼ 1; . . . ; J ) and neighborhood k (k ¼ 1; . . . ; K), where b 0 is the overall intercept, u A;j and u B;k are the random intercepts of the schools and neighborhoods, respectively, u AB;jk are the random intercepts of the school-neighborhood pairs, and e ijk are the student-specific residuals. The random effects u A;j , u B;k , and u AB;jk and the residuals e ijk are assumed to follow independent normal distributions with means of zero and variances denoted by t 2 A , t 2 B , t 2 AB , and s 2 , respectively. The main purpose of this model is to distinguish the components of variance in y that pertain to differences between schools (A), neighborhoods (B), and school-neighborhood pairs (AB). These can subsequently be used to compute the intraclass correlation (ICC) at different levels (see Raudenbush & Bryk, 2002). Conceptually, the random effects associated with A and B (u A;j and u B;k ) represent the mean differences that exist between the members of different schools or neighborhoods and that are shared among the students who attend the same school (j) or live in the same neighborhood (k). The random effect associated with the school-neighborhood pair (u AB;jk ) represents the differences between the mean values of students who belong to a particular combination of school and neighborhood (jk) above and beyond the differences that they share with other students who attend the same school (but live in different neighborhoods) or live in the same neighborhood (but attend different schools). The random effects of the cluster membership are sometimes also referred to as "main" (u A;j and u B;k ) and "interaction" (u AB;jk ) effects (e.g., Beretvas, 2011) because they reflect mean differences between groups of students similar to the main and interaction effects in a two-way between-subjects analysis of variance (see also Maxwell et al., 2018;Raudenbush & Bryk, 2002).

Cross-Classified Multiple Imputation
Random-intercept model with explanatory variables. The model above can be extended to address more interesting research questions by including explanatory variables that can be measured at any level of the sample (Raudenbush & Bryk, 2002). In addition, the model can include the cluster means of explanatory variables at Level 1, which allows the effects of this variable to take on different values at different levels. For example, with three explanatory variables x at Level 1, z at Level A, and w at Level B, the model can be extended as follows: In this model, b 1 is the effect of x at Level 1, b 2 is the effect of x at Level A, b 3 is the effect of x at Level B, and b 4 is the effect of x at Level AB (i.e., for the school-neighborhood pair). In addition, b 5 and b 6 are the effects of z and w at Levels A and B, respectively.
Notice that the extended model now partitions both y and x into within-and between-cluster components, although it does so in different ways. Specifically, the components in y are represented by random effects and residuals, whereas the components in x are represented by the values of x at Level 1 (x ijk ) and the cluster means of x at Levels A ( x j ), B ( x k ), and AB ( x jk ). The effects of the cluster means ( x j , x k , and x jk ) of x in Equation 2 represent contextual effects, that is, the extent to which cluster-level differences in x are associated with cluster-level differences in y above and beyond the effect of x at Level 1 (Raudenbush & Bryk, 2002). In sum, this model partitions the lower-level variables into within-and between-cluster components and allows these components to be associated with one another at different levels.
The models above incorporate the nonhierarchical structure of the data in multiple ways. First, the models include separate random effects and variance components for the two crossed factors A and B. By contrast, if this structure was simplified or ignored, for example, by treating the factors as hierarchical, then the estimated parameters and standard errors could be biased (Lai, 2019;Luo & Kwok, 2009;Meyers & Beretvas, 2006). Second, the models include a random effect of the "interaction" of the two factors, that is, for the combined membership of individuals in a pair of units in A and B. Omitting this component can sometimes simplify the specification of the model but can also induce bias (Shi et al., 2010). As a general rule, including the random effect of the "interaction" requires that there are multiple observations (n jk > 1) for at least some of the pairs of units in A and B; otherwise (if all n jk ¼ 1), it cannot be distinguished from the residual at Level 1 and must be dropped from the analysis (see also Beretvas, 2011). Third, the models can include effects of explanatory variables at each level as well as effects of the cluster means of explanatory variables at Level 1, which allows the cluster-level effects to differ from the effects at Level 1 (Raudenbush and Bryk, 2002; see also Kreft et al., 1995).

Grund et al.
Further extensions. The models can also be extended further to address additional research questions. For example, random slopes can be included to allow the effects of the explanatory variables to vary across the units of A or B, and explanatory variables at Levels A or B can be used to explain some of the variance in the slope coefficients (e.g., Raudenbush & Bryk, 2002). In such a model, the effects of lower-level explanatory variables vary both at random and due to the moderating influence of higher-level variables (cross-level interactions [CLIs]). In the following sections, we focus on CCRMs that include only random intercepts and linear effects. We do not consider applications with random slopes, CLIs, or other nonlinear effects in detail, but we return to these extensions later.

MI of Cross-Classified Data
In the following section, we outline two of the main strategies-JM and FCS-that are typically used to conduct MI, and we explain how these strategies can be extended for CC-MI. In addition, we consider a number of ad hoc approaches that are based on imputation approaches for single-level and twolevel (hierarchical) data. For simplicity, we focus on applications with continuous data; however, either approach can also be used with categorical data, and we return to this topic in the Discussion section.

Joint Modeling
The general idea underlying the JM approach is that a single (joint) imputation model is specified for all variables with missing data, thus generating imputations for all variables simultaneously. The JM approach was developed primarily in the context of single-and two-level data (Schafer & Olsen, 1998;Schafer & Yucel, 2002; for an overview, see Carpenter & Kenward 2013), but it has also been applied to cross-classified and multiple-membership data (Yucel et al., 2008). To conduct CC-MI with the JM approach, a multivariate CCRM that includes all variables both with and without missing data must be specified.
Suppose that the data comprise observations clustered in two factors A and B as before and with variables measured at different levels. Let further y ð1Þ denote the variables at Level 1, y ðAÞ the variables at Level A, y ðBÞ the variables at Level B, and y ðABÞ the variables at Level AB. For example, if the variables of interest were those in Equation 2, then y ð1Þ would include y and x, y ðAÞ would include z, y ðBÞ would include w, and y ðABÞ would be empty. Then, for observation i in unit j of factor A and unit k in factor B, the joint model can be written as AB;j ; u ðABÞ AB;j are the random effects and residuals at Level AB, and e ijk are residuals at Level 1. The random effects and residuals at each level are assumed to follow independent multivariate normal distributions with mean vectors of zero and covariance matrices T A , T A , T AB , and Σ, respectively. Similar to the univariate intercept-only model in Equation 1, the joint model partitions the within-and between-cluster components in the variables at each level. In addition, the model incorporates the associations that can exist between the components at each level by allowing the random effects and residuals to be correlated (i.e., through T A , T A , T AB , and Σ). Consequently, the JM approach to CC-MI incorporates information from variables at different levels in the imputation of missing data. For example, when imputing missing data at Level 1, the JM approach incorporates information from variables at Levels A, B, and AB (T A , T B , and T AB ) as well as other variables at Level 1 (Σ).
Markov Chain Monte Carlo (MCMC) algorithm. The JM approach to CC-MI can be implemented with MCMC techniques (Browne, 2009;Browne et al., 2001). In the following, we outline the main steps of a generic MCMC algorithm for the estimation of the model parameters and the imputation of missing data at each level (for additional details, see Rasbash & Browne, 2008). For convenience, we write the data of all units and variables as y ¼ y ð1Þ ; y ðAÞ ; y ðBÞ ; y ðABÞ À Á and the random effects as u ¼ ðu A ; u B ; u AB Þ. At iteration t of the MCMC algorithm, Notice that the sampling steps for the random effects (Step 6) and the residuals (Step 7) are performed by conditioning on the random effects and the observed values of other variables at the same level. In doing so, the JM approach incorporates information from other variables into the imputation of missing data, while accounting for the relationships that exist between variables at different levels.
To our knowledge, there is currently no software that implements the JM approach to CC-MI (but see Yucel et al., 2008). However, the required sampling steps can be carried out in general-purpose software for Bayesian data analysis, such as WinBUGS/OpenBUGS (Lunn et al., 2000), JAGS (Plummer, 2017), or Stan (Stan Development Team, 2021. In addition, a simplified version of this model with no random effects or variables at Level AB can be fit with the Bayesian estimation procedure in Mplus (Muthén & Muthén, 2017).

Fully Conditional Specification
As an alternative to the JM approach, the joint distribution of the variables with missing data can be approximated by imputing one variable at a time while iterating over a sequence of univariate imputation models, one for each variable with missing data. This strategy is known as the FCS approach to MI (Raghunathan et al., 2001;van Buuren et al., 2006). In the context of CC-MI, each imputation model is set up in such a way that it (a) partitions the within-and between-cluster components of the respective target variable and (b) includes other variables and their cluster means as predictors to represent the relationships between the variables at each level.

Cross-Classified Multiple Imputation
Because the FCS approach iterates along a sequence of imputation models, one for each target variable with missing data, different types of models are required to address variables at different levels. Suppose that y ðpÞ denotes the pth target variable with missing data (p ¼ 1; . . . ; P). If y ðpÞ is measured at Level 1, then the imputation model takes the form of a CCRM as follows: AB , and s 2ðpÞ . The predictors in x ðpÞ typically include all variables other than y ðpÞ . In addition, in CC-MI, x ðpÞ includes the cluster means of the predictors at Levels A, B, and AB. In doing so, the imputation model takes into account the within-and between-cluster components both in y ðpÞ (through random effects) and in x ðpÞ (through cluster means) as well as the relations that can exist between the two (through ␤). For example, if the first target variable y ð1Þ was the outcome variable y from Equation 2, then the predictors would be x, z, w, and the cluster means of x at Levels A, B, and AB ( x j , x k , and x jk ). For target variables at Levels A, B, and AB, the imputation models take on simpler forms but follow the same strategy by conditioning on the other variables at the same levels as well as the cluster means of lower-level variables. Specifically, for variables at Levels A, B, and AB, the models become: where x ðpÞ denotes the predictors, ␤ ðpÞ denotes the fixed effects, and u ðpÞ A;j , u ðpÞ B;k , u ðpÞ AB;jk denote the random effects and residuals with variance components as before. The predictor variables in x ðpÞ can include all variables other than y ðpÞ that are measured at the same level as well as the cluster means of predictors that were measured at lower-levels (e.g., at Levels 1 or AB for a target variable at Level A). In addition, in unbalanced and partially cross-classified data, x ðpÞ can include cluster means of higher-level variables that were measured at other levels (e.g., at Level B for a target variable at Level A). For example, if the second target variable y ð2Þ was the explanatory variable z from Equation 2, which is measured at Level A, then the predictors would be the cluster means of x and y at Level A ( x j and y j ) and potentially those of w (in unbalanced and partially cross-classified data).
Similar to the JM approach, the FCS approach aims to accommodate the crossclassified structure of the data in each imputation model by (a) partitioning the variables in within-and between-cluster components and (b) allowing for associations among the components at different levels. However, in the FCS approach, only the components in the target variables are represented by random effects, whereas those in the predictors are represented by cluster means (see also Enders et al., 2016;Grund et al., 2018b;Lüdtke et al., 2017;Mistler & Enders, 2017). If the cluster means themselves are based on variables with missing data, then they are updated in each step of the procedure, so that they reflect the most recent imputations of the underlying variables (see also Royston, 2005; van Buuren & Groothuis-Oudshoorn, 2011). To our knowledge, the FCS approach to CC-MI is currently supported only by the packages mice ( van Buuren & Groothuis-Oudshoorn, 2011) and miceadds  in the statistical software R (R Core Team, 2021).
FCS algorithm. The FCS approach to CC-MI that is implemented in miceadds uses an (approximate) Gibbs sampling algorithm to generate imputations by iterating along the following steps. For simplicity, we present these steps only for a single variable at Level 1, dropping the p superscript, and we write the random effects and their variances as u ¼ ðu A;j ; u B;k ; u AB;jk Þ and τ 2 ¼ ðt 2 Notice that the sampling steps for the random effects (Step 4) and the imputations at Level 1 (Step 5) are implemented by conditioning on the predictor variables and their cluster means (in x ðpÞ ijk ). In doing so, the FCS approach accommodates the relationships that can exist between the variables at different levels, similar to the JM approach. In addition, the random effects are sampled conditionally on one another and in an iterative manner. This is required in (partially) crossclassified data because the observed data provide information about multiple random effects. The sampling steps for the regression coefficients are standard

Cross-Classified Multiple Imputation
Gibbs steps with an implicit uniform prior for ␤. However, the algorithm is not a true Gibbs sampler because it omits the sampling of the variance components, relying on estimated values instead.
Single-and two-level FCS with cluster means. As an alternative to CC-MI, crossclassified data can also be accommodated with simpler imputation models (e.g., for single-or two-level data) by including the effects of cluster membership through fixed effects or additional cluster means (Andridge, 2011;Drechsler, 2015;Lüdtke et al., 2017;Wijesuriya et al., 2020). These ad hoc approaches naturally do not offer a full replacement of CC-MI, but they can be useful if software for CC-MI is not available. Here, we consider one such approach that relies on additional cluster means and can be based on single-or two-level FCS.
The FCS approach to CC-MI (Equation 4) accommodates the cross-classified structure of the data by including random effects for the target variable and cluster means for the predictor variables in each imputation model. Cluster means can be included as predictors even in simpler models, so the main difference is how these approaches address the random effects for the target. For example, when using single-level FCS, the random effects for a target variable at Level 1 can be approximated as follows: where x ðpÞ contains the predictors and their cluster means as before, and y ðpÞ jðÀiÞ , y ðpÞ kðÀiÞ , and y ðpÞ jkðÀiÞ are the cluster means at Levels A, B, and AB for the target variable, which are computed from the imputed values from the previous iteration of the imputation procedure. However, in order to avoid a direct dependency between the target variable and its own imputed values ( van Buuren, 2018, Ch. 6), these cluster means are computed individually for every case i, such that the case i is excluded from the computation, that is: y Going forward, we will refer to these cluster means as adjusted cluster means. Conceptually, the adjusted cluster means can be regarded as a proxy for the information about case i that the other cases within the same cluster provide (i.e., the ICC), thus mimicking the contribution of the random effects in CC-MI. This method requires the adjusted cluster means to be updated at each iteration of Grund et al. the imputation procedure with "passive" imputation steps, an option that is provided by many statistical software packages (Raghunathan et al., 2018;Royston, 2005; van Buuren Groothuis-Oudshoorn, 2011).

Differences Between JM and FCS
The main difference between the JM and FCS approaches to CC-MI is how they represent the between-cluster components in the variables included in the imputation model. In the JM approach, the imputation model is based on a multivariate CCRM and represents the between-cluster components with random effects, which correspond to the latent within-and between-group components in the variables at each level (Asparouhov & Muthén, 2006;Lüdtke et al., 2008). By contrast, the FCS approach is based on a sequence of univariate CCRMs, in which the between-cluster components of the target variable are also represented by random effects, whereas those of the predictor variables are represented by manifest cluster means (e.g., Raudenbush & Bryk, 2002).
In the context of hierarchical data, it has been shown that the JM and FCS approaches to MI are equivalent in cases with balanced data, that is, when all clusters have the same size (Carpenter & Kenward, 2013; see also Enders et al., 2016;Lüdtke et al., 2017;Resche-Rigon & White, 2018). Specifically, for balanced data, it can be shown that the two approaches represent the conditional distribution of the missing data in different but equivalent ways, provided that the cluster means are included in the FCS approach. Grund et al. (2018a) further showed that the two approaches provide nearly identical results even in unbalanced data, where the exact equivalence between them no longer holds (see also Resche-Rigon & White, 2018).
In the Appendix, and in more detail in Supplement A in the Online Supplemental Materials, we extend these considerations to cross-classified data and found that the same result holds but under stronger conditions. Specifically, we found that the FCS and JM approaches to CC-MI are asymptotically equivalent in balanced fully cross-classified data, that is, when the number of units in A and B and the cluster sizes are constant and the sample is sufficiently large. The equivalence holds only asymptotically, because the FCS approach induces a slight dependency between marginally independent observations whose strength diminishes as the numbers of units in A and B become large. For this reason, the FCS and JM approaches are not formally equivalent in cross-classified data. Nonetheless, given that the discrepancy between the FCS and JM approaches appears to be relatively minor, we would still expect their performances to be similar in practice (see also Grund et al., 2018a). In addition, from a practical perspective, the FCS approach often has advantages over the JM approach because it allows for a more flexible specification of the imputation models, with finer control over what type of imputation model is used for each variable and which predictor variables are included in them (see also van Buuren et al., 2006).

Cross-Classified Multiple Imputation
Previous research on CC-MI has focused primarily on the FCS approach (Wijesuriya, 2021;however, see Yucel et al., 2008) or specific applications, such as missing item responses in educational assessments (Kadengye et al., 2014) or missing data in social network analysis (Jorgensen et al., 2018). In addition, Hill and Goldstein (1998) considered the special case of missing unit identifiers in longitudinally cross-classified data. Overall, these studies suggest that methods for CC-MI can provide an effective treatment of missing values in crossclassified data. However, to our knowledge, no study has systematically compared the JM and FCS approaches to CC-MI in more general settings.

Simulation
In the following, we present the results of a simulation study in which we evaluated the performance of different MI approaches for cross-classified data. This included the JM and FCS approaches to CC-MI as well as ad hoc approaches that were based on single-and two-level MI. The computer code needed to run the simulation study is provided in the OSF repository (https://osf.io/5em2d).

Data Generation
In the simulation study, we generated data for two standardized variables x and y from a multivariate CCRM (see Equation 3) with observations clustered within two crossed factors A (e.g., schools) and B (e.g., neighborhoods). Specifically, for an observation i (i ¼ 1; . . . ; N ) in unit j of factor A ( j ¼ 1; . . . ; J ) and unit k of factor B (k ¼ 1; . . . ; K), the data were generated with the following model: where the random effects (u A;j , u B;k , and u AB;jk ) and the residuals (e ijk ) followed independent bivariate normal distributions with mean vectors of zero and covariance matrices T A , T B , T AB , and Σ, respectively. The covariance matrices were chosen in such a way that (a) the two variables x and y would have certain proportions of variance at each level (t 2 A , t 2 B , t 2 AB , and s 2 ), and (b) the random effects and residuals of x and y would be correlated to a given extent.
Partial cross-classification. The model in Equation 8 covers both fully and partially cross-classified data. In this study, we focused on partially crossclassified data and used the following procedure to assign a subset of the units in B to a subset of the units in A (see Figure 2). The procedure consisted of two steps. First, we defined the number of units in B that needed to be assigned to each unit in A (n B=A ) and initially assigned these units in a block-like pattern, where the first n B=A units in B would be assigned to the first n A=B ¼ n B=A Á J K units in A, and so on. Second, we introduced a certain number of random permutations into these initial assignments. This resulted in partially cross-classified data with Grund et al. a total number of n B=A Á J pairs between A and B and with an assignment pattern that was in part deterministic and in part random. For each pair of A and B, we then generated either a balanced or unbalanced number of observations at Level 1. In the balanced case, we generated clusters of constant size (all n jk ¼ n); in the unbalanced case, we chose one unit in B per unit in A and increased the cluster size for this pair while decreasing the cluster size for all other units in B that were assigned to the same unit in A. We did this in such a way that the average cluster size would remain unchanged ( P j P k n jk ¼ n) in the unbalanced data. Table 1 shows an example with J ¼ 16 units in A and K ¼ 32 units in B, where we assigned n B=A ¼ 8 units in B to each unit in A with 20% random assignments and an unbalanced number of observations with an average cluster size of n ¼ 5.
Missing data. Once the data were generated, we induced missing values in x on the basis of the values in y in accordance with an missing completely at random (MCAR) or missing at random (MAR) mechanism that we simulated with the following linear model: In this model, a is a quantile of the standard normal distribution that determines the probability of missing data (e.g., a ¼ À0:674 for 25% missing data), and l determines the missing data mechanism (i.e., MCAR if l ¼ 0, MAR otherwise).
A value x ijk was set to missing when the corresponding r ijk > 0.

Simulated Conditions
Using this data generating procedure, we varied the sample sizes at Levels A and B, where we fixed the number of units in A to J ¼ 128 and set the number of units in B to K ¼ 64, 128, or 256. For the cross-classification, we set the number of units in B per unit in A to n B=A ¼ 8, which resulted in a constant number of 1,024 pairs between A and B in all conditions. In addition, we simulated both balanced and unbalanced samples with an average cluster size of n ¼ 5. This resulted in cross-classified data with a total sample size of N ¼ 5;120 and moderate to strong degrees of cross-classification as indicated by Cramér's V (i.e., in comparison with a hierarchical data structure; see Lai, 2019). 1 For the variance components, we fixed the residual variances at Level 1 (s 2 ) to .50 and set the variances of the random effects at Levels A, B, and AB (t 2 A , t 2 B , and t 2 AB ) to values between .10 and .30. Specifically, we considered three configurations: one equalvariance condition, where all variances were set to .20; and two unequal-variance conditions, where the variances at Levels A, B, and AB were set to .30, .20, and .10, or .20, .30, and .10, respectively. The correlations between the random effects at Levels A, B, and AB were fixed to .50, and the correlations between the residuals at Level 1 were fixed to .20. Finally, we simulated both MCAR (l ¼ 0) and MAR data (l ¼ :70) and fixed the probability of missing data to Cross-Classified Multiple Imputation  25%. This resulted in 36 simulated conditions, each of which was replicated 2,000 times. 2

Imputation and Analysis
To handle the missing data, we considered 10 different imputation approaches. This included the JM and FCS approaches to CC-MI as well as eight ad hoc approaches based on single-and two-level FCS. The ad hoc approaches included both "naive" specifications of these methods, which reflect common recommendations for applications in single-and two-level data, as well as the extended specifications that aim to accommodate the cross-classified data structure by including adjusted cluster means. Specifically, the methods were as follows: 1. FCS-1L: single-level FCS. 2. FCS-1L-M: single-level FCS, extended to include cluster means for the predictor and adjusted cluster means for the target variable (at Levels A, B, and AB). 3. FCS-2L-A: two-level FCS with random effects of A and cluster means for the predictor. 4. FCS-2L-A-M: two-level FCS with random effects of A and cluster means for the predictor, extended to include adjusted cluster means for the target variable (at Levels B and AB). 5. FCS-2L-B: two-level FCS with random effects of B and cluster means for the predictor. 6. FCS-2L-B-M: two-level FCS with random effects of B and cluster means for the predictor, extended to include adjusted cluster means for the target variable (at Levels A and AB). 7. FCS-2L-AB: two-level FCS with random effects of AB and cluster means for the predictor. 8. FCS-2L-AB-M: two-level FCS with random effects of AB and cluster means for the predictor, extended to include adjusted cluster means for the target variable (at Levels A and B). 9. FCS-CC: FCS approach to CC-MI. 10. JM-CC: JM approach to CC-MI.
To implement JM-CC, we used OpenBUGS (Lunn et al., 2000); and to implement FCS-CC and the methods based on single-and two-level FCS, we used the R packages mice and miceadds ; van Buuren & Groothuis-Oudshoorn, 2011). Finally, we also conducted the analyses with the complete data (CD) and after listwise deletion (LD) to provide a means of comparison.
In the analysis of the imputed data, we were interested in two different models. The first model was an intercept-only CCRM for x (see Equation 1). The second model was a random-intercept CCRM with y as the outcome variable and x as the explanatory variable (see Equation 2). The parameters of interest were the estimated variance components in x at each level (t ). For each parameter, we computed the relative bias and the coverage rates of the 95% confidence interval (CI) to evaluate the accuracy of the parameter estimates and the estimated standard errors. Due to the different representation of the between-cluster components in the data-generating model and the analysis, the true values of the regression coefficients cannot generally be expressed in closed form. For this reason, we used the average estimates in the CD as reference values in the computation of the bias and coverage.

Results
The main results are summarized in Tables 2 through 4. For simplicity, we present the detailed results only for selected conditions with J ¼ K ¼ 128 units in A and B, unbalanced clusters, and MAR data. The results for the other conditions were often similar, so we discuss them only when needed and provide them in full in Supplement C in the Online Supplemental Materials and the OSF repository (https://osf.io/5em2d).
The results for the estimated variance components in x are presented in Table 2. The FCS and JM approaches to CC-MI (FCS-CC, JM-CC) provided approximately unbiased parameter estimates in all simulated conditions. By contrast, the single-and two-level FCS approaches (FCS-1L, FCS-2L) led to bias unless their specification was extended to include the adjusted cluster means of the incomplete variable x. The direction and size of the bias depended on the variance configuration and how the random effects at each level were accommodated by these procedures. Specifically, single-level FCS (FCS-1L) underestimated the variances of the random effects and overestimated the residual variance at Level 1. Two-level FCS (FCS-2L-A, FCS-2L-B, and FCS-2L-AB) slightly overestimated the variance of the random effect that was included in the imputation model (e.g., t ðxÞ2 A for FCS-2L-A), where the bias was largest for FCS-2L-AB. In addition, these methods underestimated the variances of the other random effects and overestimated the residual variance at Level 1, except for FCS-2L-AB, which estimated the residual variance with approximately no bias. The bias tended to be largest when the omitted variance components were large. However, when we extended the single-and two-level FCS approaches to include the adjusted cluster means, there was little to no bias in the estimated variance components. Finally, LD led to essentially unbiased estimates of the variance components. These results were fairly consistent across the simulated conditions.
The bias in the estimated regression coefficients in the CCRM of y regressed on x is shown in Table 3. Similar to before, FCS-CC and JM-CC provided estimates of the regression coefficients with little to no bias, whereas the Grund et al.    estimates provided by the single-and two-level FCS approaches (FCS-1L and FCS-2L) were strongly biased unless their specifications also included the adjusted cluster means of the incomplete variable. When we extended singleand two-level FCS in this manner, the bias in the regression coefficients became smaller for two-level FCS with random effects of A or B (FCS-2L-A-M and FCS-2L-B-M) and even more so for single-level FCS (FCS-1L-M) and two-level FCS with random effects of AB (FCS-2L-AB-M). The size of the remaining bias depended on the variance configuration, such that the bias was largest when the corresponding variance component was large, especially for FCS-2L-A-M and FCS-2L-B-M. LD led to a consistent bias in all regression coefficients. Similar to above, the results were fairly consistent across conditions, except for LD, which provided unbiased results under MCAR. Finally, the coverage rates for the 95% CIs of the estimated regression coefficients (Table 4) followed the same pattern as the bias. Specifically, for FCS-CC and JM-CC, the coverage rates were close to the nominal value of 95%. For the single-and two-level FCS approaches without adjusted cluster means (FCS-1L, FCS-2L-A, FCS-2L-B, and FCS-2L-AB), the coverage rates were well below the nominal value. By contrast, when we included the adjusted cluster means, we found coverage rates close to the nominal value for both single-and two-level FCS (FCS-1L-M, FCS-2L-A-M, FCS-2L-B-M, and FCS-2L-AB-M). For LD, we found coverage rates well below the nominal value of 95%. These results were again fairly consistent across conditions, except for LD, which showed nominal coverage under MCAR.
To summarize, the results of the simulation study suggested three key findings. First, both the FCS and JM approaches to CC-MI provided accurate results in all simulated conditions. Second, the conventional single-or two-level FCS approaches performed poorly because they failed to accommodate the crossclassified data structure. Third, when the single-and two-level FCS approaches were extended to include the adjusted cluster means for the incomplete target variables, they provided much more accurate results that were very similar to CC-MI. These results are encouraging because they suggest that CC-MI as well as suitable extensions of single-and two-level MI can provide an effective treatment of missing values in cross-classified data.

Example Analysis
To illustrate the application of the different approaches to CC-MI, we use data from the Early Childhood Longitudinal Study (ECLS-K 1998). The ECLS-K is a longitudinal study that focuses on childrens' early school experiences with multiple measurements beginning in Kindergarten (1998), through primary school (1999)(2000)(2001)(2002)(2003)(2004), and up to secondary school (8th grade, 2007). An interesting feature of the ECLS-K data is that most children in the sample change schools when they transition from primary to secondary school, which means that the Cross-Classified Multiple Imputation students are cross-classified by primary school (factor P) and secondary school (factor S).
In this example, we use a subset of the ECLS-K data with observations from Grades 5 and 8, comprising a sample of 9,067 students from 1,997 primary and 2,502 secondary schools who changed schools during that time and for whom school membership at both time points was known (Cramér's V ¼ .858). Specifically, we are interested in the relationship between reading achievement and the amount of time children spent doing homework after their transition to secondary school, controlling for differences between types of schools (private vs. public): In addition, we fit an intercept-only model for the amount of time spent on homework to quantify the amount of variance between primary schools, secondary schools, and primary-secondary school pairs. In this sample, 553 (6.1%) of the cases had missing data on at least one of the three variables.
To handle the missing data, we used a subset of the methods presented above: LD, single-level FCS with (adjusted) cluster means (FCS-1L-M), and the FCS approach to CC-MI (FCS-CC). These methods were chosen, because they either performed well in our simulation study (FCS-1L-M and FCS-CC) or as a means of comparison (LD). To implement the two MI approaches, we used the R packages mice and miceadds, and we generated 20 imputed data sets. We also used the packages EdSurvey (Bailey et al., 2021) to process the data, lme4 (Bates, 2010) to fit the analysis models, and mitml  to pool the results using Rubin's (1987) rules. The computer code for this example is provided in Supplement B in the Online Supplemental Materials and the OSF repository (https://osf.io/5em2d).
The results are presented in Table 5. Overall, the results showed that students who spent more time on homework had higher reading achievement (at Level 1). In addition, there was a positive contextual effect of time spent on homework at the primary school level (P), indicating that students who had attended primary schools that assigned more homework had higher reading achievement in secondary school. However, there were no contextual effects at the secondary school level (S) or at the level of the primary-secondary school interaction (P-S). There was also a positive effect of the type of school, indicating that students at private (vs. public) schools had higher reading achievement. Finally, the results for the variance components indicated substantial amounts of variance at each level (P, S, and P-S). Due to the relatively small percentage of missing data, the results were very consistent across the methods for handling missing data, and the estimated regression coefficients were usually within half a unit of a standard error from each other. The largest difference in the estimated coefficients was the Cross-Classified Multiple Imputation effect of time spent on homework at the secondary school level, which was essentially zero for LD and slightly positive (but non-significant) for FCS-1L-M and FCS-CC.

Limitations and Extensions
An important requirement of MI is that the imputation procedure must accommodate the relevant features of the data and the intended analyses. In our study, we focused on applications, in which the intended analyses were CCRMs with random intercepts and explanatory variables with linear effects. For these applications, we outlined how the JM and FCS approaches to MI can be implemented to accommodate cross-classified data, using methods that were either based directly on univariate and multivariate CCRMs (JM-CC and FCS-CC) or that emulated them in an ad-hoc manner (e.g., FCS-1L-M). The Grund et al.
main strength of these approaches is that they provide a fairly general treatment of missing data in cross-classified data that support a broad range of CCRMs within these limits. This is particularly useful when the imputed data will be used by multiple analysts and in potentially many different analyses. Naturally, CCRMs can also be extended by including random slopes or nonlinear effects (e.g., CLIs). These types of effects complicate the treatment of missing data, because they cause conventional MI approaches such as JM and FCS to become incompatible with the intended analysis (Du et al., 2022; see also Seaman et al., 2012). Recent research has shown that substantivemodel-compatible (SMC) versions of these approaches can be used to ensure compatibility by including the substantive analysis model directly in the imputation procedure (Bartlett et al., 2015;Goldstein et al., 2014). Several studies have shown that SMC methods can be extremely effective at handling missing data in single-and multilevel analyses with nonlinear effects (Erler et al., 2017;Grund, Lüdtke, et al., 2021;Lüdtke et al., 2020). The main advantage of SMC methods is that they can be fine-tuned to accommodate the more complex features of a particular analysis at the cost of making the treatment of missing data more specific to this analysis.
In principle, SMC versions of the JM and FCS approach to CC-MI could also be used in applications of CCRMs with random slopes and nonlinear effects (see also Goldstein et al., 2014). To our knowledge, there is currently no software that provides SMC versions of JM and FCS for cross-classified data. As an alternative, general-purpose software for Bayesian data analysis (e.g., WinBUGS/OpenBUGS, JAGS, or Stan) can be used to implement an SMC version of JM or the sequential modeling approach (Ibrahim et al., 2002) to MI. The sequential modeling approach ensures compatibility by factorizing the joint distribution of the variables into a sequence of univariate conditional models, one of which corresponds to the intended analysis (Ibrahim et al., 2002; see also Lüdtke et al., 2020).
In addition to the compatibility issues caused by nonlinear effects, it has been shown that the imputation models used in the FCS approach are sometimes incompatible with each other, even if the intended analysis includes only linear effects (Liu et al., 2014;Zhu & Raghunathan, 2015). This issue has also been raised in the context of multilevel analyses, where the conditional models employed by FCS sometimes do not correspond to a well-defined joint model (Resche-Rigon and White, 2018; see also Du et al., 2022). In the present study, we found that a similar problem applies to the FCS approach in cross-classified data (see the Appendix), although this did not have any noticeable impact on its performance (see also Grund et al., 2018a). Nonetheless, the (lack of) compatibility in the FCS approach remains an important issue, and researchers should be mindful when applying this method in practice (for a more detailed discussion, see Du et al., 2022).

Discussion
In the present article, we compared different approaches for the imputation of missing values in cross-classified data (CC-MI). To this end, we introduced an extension of the popular JM and FCS approaches to MI for incomplete crossclassified data. On the basis of theoretical considerations and the results of a simulation study, we found that-though not formally equivalent-both the JM and FCS approaches to CC-MI provided an effective treatment of incomplete cross-classified data. In addition, we found that simpler approaches based on single-or two-level FCS can accommodate cross-classified data by extending the imputation models to include (adjusted) cluster means. Finally, we illustrated the application of these methods with the mice and miceadds packages for the statistical software R in a worked example with real data from education research.
Our findings have multiple implications for practice. First, the JM and FCS approaches to CC-MI appear to be similarly suited for handling incomplete cross-classified data. The FCS approach can be particularly convenient because it can easily handle different types of variables (e.g., a mixture of continuous and categorical data) and provides finer control over the selection of predictor variables in each model. For the FCS approach to CC-MI to accommodate the crossclassified structure of the data, the imputation models should include (a) the random effects for each crossed factor and (if possible) their interaction and (b) cluster means of the predictor variables to accommodate the relationships between the variables at each level. As an alternative to CC-MI, cross-classified data can also be handled with simpler techniques that are based on single-or twolevel FCS. This can be beneficial when methods for CC-MI are unavailable or suffer from convergence problems. In such cases, single-or two-level FCS approaches can be extended to include (adjusted) cluster means, which can reduce or avoid the computational burden of modeling multiple random effects in CCRMs while providing results that are often similar to CC-MI.
The present study has several limitations in addition to those listed above. First, there are several variants of CCRMs for nonhierarchical data that should be considered in future research. This includes multiple-membership multipleclassification (MMMC) models that can be used to analyze data, in which observations are not only clustered in multiple crossed factors but can also belong to multiple units of the same factor (Browne et al., 2001;Grady & Beretvas, 2010;Park & Beretvas, 2020). Similarly, in longitudinal research, CCRMs can be extended to distinguish between acute and cumulative effects of cluster membership when this membership changes over time (Cafri et al., 2015). Although MMMC and similar models share many of the features of the CCRMs considered in this article, little is known about how multiple-membership structures can and should be accommodated in the treatment of missing data (however, see Yucel et al., 2008). Second, in our simulation study, we focused on partially cross- Grund et al. classified data with structural features that are typical in education research (e.g., Garner & Raudenbush, 1991;Goldstein & Sammons, 1997;Raudenbush, 1993). Future research should also evaluate CC-MI in settings with more challenging features, for example, with a small number of clusters at Levels A and B, or common features from other areas of research, for example, with fully crossclassified data or only a single observation at Level 1 (i.e., without random effects of the "interaction"). Third, we assumed that the identifiers that denote the cluster membership for each unit were fully observed. However, especially in longitudinal data, in which cluster membership can change over time, these identifiers can also be missing, and more research is needed to determine how to handle missing data in these cases (see also Hill & Goldstein, 1998; van Buuren, 2011). Fourth, in our simulation study, we did not consider higher-level variables with missing data (e.g., at Levels A, B, or AB). Future research should therefore evaluate the performance of the different approaches to CC-MI for the treatment of missing data in higher-level variables (see also Enders et al., 2018;Grilli et al., 2022;Grund et al., 2018b). Finally, we evaluated CC-MI only in conditions with MCAR or MAR data. By contrast, when data are missing not at random (MNAR), conducting MI typically requires strong assumptions about the missing data mechanism, which can be evaluated in sensitivity analyses (Carpenter & Kenward, 2013). Future research should therefore evaluate CC-MI in conditions with MNAR data.
To summarize, the present study compared and evaluated multiple approaches to CC-MI, which included both a novel extension of the popular JM and FCS approaches to MI and a number of alternative methods that extend existing methods for single-and two-level MI. We conclude that multiple approaches to CC-MI can provide an effective treatment of incomplete cross-classified data. We hope that the findings presented here motivate further research on statistical methods for handling missing values in cross-classified data and other types of nonhierarchical data.

Appendix
In this Appendix, we present the key results from our comparison of the FCS approach to CC-MI with the JM approach (for additional details, see Supplement A). We consider the bivariate case with two variables x ð1Þ and x ð2Þ , where observations are clustered within two factors A and B. The joint model is can be expressed as: x ð1Þ x ð2Þ " # *N m ð1Þ 1 where 1 is a vector of ones, I is the identity, and J A , J B , and J AB are block matrices that define the covariance structure in the cross-classified data. Specifically, if I ðpÞ is the identity matrix of size p and J ðpÞ is a square matrix of ones of size p, then To show that the FCS approach is equivalent with the joint model, one needs to show that (a) the conditional distribution in the FCS approach implies the same structure and (b) the parameters of the conditional distribution in the FCS approach are one-to-one functions of the parameters in the joint model. Using the steps outlined in Supplement A, the conditional mean and variance of x ð2Þ given the observed values and the cluster means of x ð1Þ can be expressed as: and V x ð2Þ jx ð1Þ where J ¼ J ðnn A=B n B=A Þ , 1 ðpÞ is a p-vector of ones, x ð1Þ is the sample mean of x ð1Þ , x A is the vector of marginal means for each A, x B is the vector of marginal means for each B, and x ð1Þ AB is the vector of means for each pair of A and B.
Grund et al.

Equation A4
already indicates that the conditional distribution in the FCS approach has a covariance structure that differs from the joint model. Specifically, by conditioning on x ð1Þ , the FCS approach induces a dependency between marginally independent observations in x ð2Þ (through g J J). The coefficients are given in full in Supplement A. Here, we only reproduce the results for g J and define (for p; q 2 f1; 2g): Then, Because g J is not generally zero, the FCS approach is not generally equivalent with the JM approach. However, the equivalence can be shown to hold asymptotically. To this end, we write g J ¼ Àn f ðn B=A ;n A=B Þ gðn B=A ;n A=B Þ , where both f and g are the polynomial functions of n B=A and n A=B . After some manipulations, it can be shown that f has a lower degree than g. Specifically, f and g can be expressed as: f ðn B=A ; n A=B Þ ¼ x 1 n B=A n A=B þ x 2 n B=A þ x 3 n A=B þ x 4 ; ðA5Þ and gðn B=A ; n A=B Þ ¼ z 1 n 2 B=A n A=B þ z 2 n B=A n 2 A=B þ z 3 n 2 B=A þ z 4 n B=A n A=B þ z 5 n 2 A=B þ z 6 n B=A þ z 7 n A=B þ z 8 : As a result, g J approaches zero as either n B=A or n A=B becomes large. In this event, the structure of the conditional distribution in the FCS approach matches the structure of the joint distribution with coefficients that are one-to-one functions of the parameters in the joint model. Therefore, although not strictly equivalent, the FCS approach to CC-MI is still asymptotically equivalent with the JM approach in balanced fully crossed data.
Some additional results are worth noting. First, in unbalanced data, the two approaches are not equivalent (see also Resche-Rigon & White, 2018). In such a case, the coefficients for the conditional mean and variance (Equations A3 and A4) would depend not only on the parameters in the joint model but also on the cluster size (n jk ) and the number of units in A and B, which may also vary between subsets of clusters. Second, in partially crossed data, the equivalence Cross-Classified Multiple Imputation may also not hold. For example, if the units are partially crossed in a "block-like" pattern, where each block consists of a subset of fully crossed units, then the arguments above apply to each block and the asymptotic result should still hold. However, in such cases, the cluster sizes, number of units in A, and B, and cluster means will usually vary across blocks, which again complicates the structure of the conditional distribution in the FCS approach.

Authors' Note
All additional files concerning this article, including scripts, data, and the supplemental materials are available at https://osf.io/5em2d.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD
Simon Grund https://orcid.org/0000-0002-1290-8986 Notes 1. The Cramér's V values were computed on the basis of 100 simulated data sets from each condition and ranged from .333 for the condition with K ¼ 64 and balanced samples to .543 for the condition with K ¼ 256 and unbalanced samples. 2. Because we encountered convergence issues in a small number of replications, we removed replications that resulted in singular solutions or yielded extreme parameter estimates (absolute values larger than 10). In the condition that was affected the most (balanced clusters, K ¼ 64, t 2 A ¼ :30, MCAR), this resulted in the removal of seven (0.4%) of the 2,000 replications.