Dirichlet Process Mixture Models for Modeling and Generating Synthetic Versions of Nested Categorical Data

We present a Bayesian model for estimating the joint distribution of multivariate categorical data when units are nested within groups. Such data arise frequently in social science settings, for example, people living in households. The model assumes that (i) each group is a member of a group-level latent class, and (ii) each unit is a member of a unit-level latent class nested within its group-level latent class. This structure allows the model to capture dependence among units in the same group. It also facilitates simultaneous modeling of variables at both group and unit levels. We develop a version of the model that assigns zero probability to groups and units with physically impossible combinations of variables. We apply the model to estimate multivariate relationships in a subset of the American Community Survey. Using the estimated model, we generate synthetic household data that could be disseminated as redacted public use files with high analytic validity and low disclosure risks. Supplementary materials for this article are available online.


Introduction
In many settings, the data comprise units nested within groups (e.g., people within households), and include categorical variables measured at the unit level (e.g., individuals' demographic characteristics) 1 arXiv:1412.2282v1 [stat.ME] 6 Dec 2014 and at the group level (e.g., whether the family owns or rents their home). A typical analysis goal is to estimate multivariate relationships among the categorical variables, accounting for the hierarchical structure in the data.
To estimate joint distributions with multivariate categorical data, many analysts rely on mixtures of products of multinomial distributions, also known as latent class models. These models assume that each unit is a member of an unobserved cluster, and that variables follow independent multinomial distributions within clusters. Latent class models can be estimated via maximum likelihood (Goodman, 1974) and Bayesian approaches (Ishwaran and James, 2001;Dunson and Xing, 2009;Jain and Neal, 2007). Of particular note, Dunson and Xing (2009) present a nonparametric Bayesian version of the latent class model, using a Dirichlet process mixture (DPM) for the prior distribution. The DPM prior distribution is appealing, in that (i) it has full support on the space of joint distributions for unordered categorical variables, ensuring that the model does not restrict dependence structures a priori, and (ii) it fully incorporates uncertainty about the effective number of latent classes in posterior inferences.
For data nested within groups, however, standard latent class models may not offer accurate estimates of joint distributions. In particular, it may not be appropriate to treat the data for units in the same group as independent and identically distributed; for example, demographic variables like age, race, and sex of individuals in the same household are clearly dependent. Similarly, some combinations of units may be physically impossible to place in the same group, such as a daughter who is older than her biological father. Additionally, every unit in a group must have the same values of group-level variables, so that one cannot simply add multinomial kernels for the group-level variables.
In this article, we present a nonparametric Bayesian model for nested categorical data. The model assumes that (i) each group is a member of a group-level latent class, and (ii) each unit is a member of a unit-level latent class nested within its group-level latent class. This structure encourages the model to cluster groups into data-driven types, for example, households with children where everyone has a particular race. This in turn allows for dependence among units in the same group. The nested structure also facilitates simultaneous modeling of variables at both group and unit levels. We refer to the model as the nested Dirichlet process mixture of products of multinomial distributions (NDPMPM), as we use a nested Dirichlet process (Rodriguez et al., 2008) as the prior distribution for the class membership probabilities. We present two versions of the NDPMPM: one that gives support to all configurations of groups and units, and one that assigns zero probability to groups and units with physically impossible combinations of variables (also known as structural zeros in the categorical data analysis literature).
The NDPMPM is similar to the latent class models proposed by Vermunt (2003Vermunt ( , 2008, who also uses two layers of latent classes to model nested categorical data. These models use a fixed number of classes as determined by a model selection criterion (e.g., AIC or BIC), whereas the NDPMPM uses a nested DP to allow uncertainty in the effective number of classes at each level. To the best of our knowledge, the models of Vermunt (2003Vermunt ( , 2008 do not include group-level random variables-they allow group-level predictors of latent class assignments, but not group-level outcomes-nor do these models account for groups with physically impossible combinations of individuals. The remainder of this article is organized as follows. In Section 2, we present the NDPMPM model for data when all configurations of groups and units are feasible. In Section 3, we present a data augmentation strategy for constructing a version of the NDPMPM that puts zero probability on impossible combinations. In Section 4, we illustrate and evaluate the NDPMPM models using household demographic data from the American Community Survey (ACS). In particular, we use posterior predictive distributions from the NDPMPM models to generate partially synthetic datasets (Rubin, 1993;Little, 1993;Reiter, 2003), and compare results of representative analyses done with the synthetic and original data. Finally, in Section 5, we conclude with discussion of implementation of the proposed models.

The NDPMPM Model
As a working example, we suppose the data include N individuals residing in only one of n < N households. For i = 1, . . . , n, let n i > 0 equal the number of individuals in house i, so that n i=1 n i = N .
For k = 1, . . . , p, let X ijk ∈ {1, . . . , d k } be the value of categorical variable k for person j in household i, where i = 1, . . . , n and j = 1, . . . , n i . For k = p + 1, . . . , p + q, let X ik ∈ {1, . . . , d k } be the value of categorical variable k for household i, which is assumed to be identical for all n i individuals in household i. For now, we assume no impossible combinations of variables within individuals or households.
We assume that each household belongs to some group-level latent class, which we label with G i , where i = 1, . . . , n. Let π g = Pr(G i = g) for any class g; that is, π g is the probability that household i belongs to class g for every household. For any k ∈ {p + 1, . . . , p + q} and any value c ∈ {1, . . . , d k }, let λ (k) gc = Pr(X ik = c | G i = g) for any class g; here, λ (k) gc is the same value for every household in class g.
Within each household class, we assume that each individual belongs to some individual-level latent class, which we label with M ij , where i = 1, . . . , n and j = 1, . . . , n i . Let for any class (g, m); that is, ω gm is the conditional probability that individual j in household i belongs to individual-level class m nested within group-level class g, for every individual. For any k ∈ {1, . . . , p} We let both the q household-level variables and p individual-level variables follow independent, classspecific multinomial distributions. Thus, the model for the data and corresponding latent classes in the truncated NDPMPM is where each multinomial distribution has sample size equal to one and number of levels implied by the dimension of the corresponding probability vector. Here, we allow the multinomial probabilities for individual-level classes to differ by household-level class. One could impose additional structure on the probabilities, for example, force them to be equal across classes as suggested in Vermunt (2003Vermunt ( , 2008; we do not pursue such generalizations here. As prior distributions on π and φ, we use the truncated stick breaking representation of the Dirichlet process (Sethuraman, 1994), following the approach described in Rodriguez et al. (2008). We have As prior distributions on λ and φ, we use independent Dirichlet distributions, We set a k1 = · · · = a kd k = 1 for all k to correspond to uniform distributions. Following Dunson and Xing (2009) and Si and Reiter (2013), we set (a α = .25, b α = .25) and (a β = .25, b β = .25) , which represents a small prior sample size and hence vague specification for the Gamma distribution. We estimate the posterior distribution of all parameters using a blocked Gibbs sampler (Ishwaran and James, 2001;Si and Reiter, 2013); see Appendix A for the relevant full conditionals.
Intuitively, the NDPMPM seeks to cluster households with similar compositions. Within the pool of individuals in any household-level class, the model seeks to cluster individuals with similar compositions of individuals. Because individual-level latent class assignments are conditional on household-level latent class assignments, the model induces dependence among individuals in the same household (more accurately, among individuals in the same household-level cluster). To see this mathematically, consider the expression for the joint distribution for variable k for two individuals j and j in the same household We also note that, as argued in Rodriguez et al. (2008), the nested DP prior distribution a priori induces stronger associations among individuals in the same group than among individuals in different groups.
For values of F and S, we recommend that analysts start with modest values, say F = 10 and S = 10, to favor expedient computations. After convergence of the MCMC chain, analysts should check how many latent classes at the household-level and individual-level are occupied across the MCMC iterations. When the numbers of occupied household-level classes hits F , analysts should increase F .
When this is not the case but the number of occupied individual-level classes hits S, analysts can try increasing F alone, as the increased number of household-level latent classes may sufficiently cature heterogeneity across households as to make S adequate. When increasing F does not help, for example there are too many different types of individuals, analysts can increase S, possibly in addition to F . We emphasize that these types of titrations are useful primarily to reduce computation time; analysts always can increase S and F simultaneously so that they exceed the number of occupied classes in initial runs.
It can be computationally convenient to set β g = β for all g in (10), as doing so reduces the number of parameters in the model. Allowing the β g to be class-specific offers additional flexibility, as the prior distribution of the household-level class probabilities can vary by class. In our evaluations of model on the ACS data, using a common or distinct value had only minor impact on the results.

Adapting the NDPMPM for Impossible Combinations
The models in Section 2 make no restrictions on the compositions of groups or individuals. However, in many contexts this is unrealistic. Using our working example, suppose that the data include a variable that characterizes relationships among individuals in the household, as does the ACS. Levels of this variable include household head, spouse of household head, parent of the household head, etc. By definition, each household must contain exactly one household head. Additionally, by definition (in the ACS), each household head must be at least 15 years old. Thus, we require a version of the NDPMPM that enforces zero probability for any household that has zero or multiple household heads, and any household headed by someone younger than 15.
We need to modify the likelihoods in (1) and (2) to enforce zero probability for impossible combinations. Equivalently, we need to truncate the support of the NDPMPM. To express this mathematically, let C represent all combinations of individuals and households, ignoring any impossible combinations.
For any household with h individuals, let S h ∈ C be the set of combinations that should have zero probability, i.e., the set of impossible combinations for households of that size.
H is the set of all household sizes in the observed data. We define a random variable for all the data for person j in household i as X * ij = (X ij1 , . . . , X ijp , X ip+1 , . . . , X ip+q ), and the random variable for all data in household i as X Here, we write a superscript * to indicate that the random variables have support only on C − S; in contrast, we use X ij and X i to indicate the corresponding random variables with unrestricted support on C. Letting X * = (X * 1 , . . . , X * n ) be a sample comprising data from n households, the likelihood component of the truncated NDPMPM model can be written as where θ includes all parameters of the model described in Section 2, and 1{.} equals one when the condition inside the {} is true and equals zero otherwise.
For all h ∈ H, let n * h = n i=1 1{n i = h} be the number of households of size h in X * . From (14), we seek to compute the posterior distribution Here, π 0h (θ) = P r(X i ∈ S h |θ), where again X i is the random variable with unrestricted support.
The Gibbs sampling strategy from Section 2 requires conditional independence across individuals and variables, and hence unfortunately is not appropriate as a means to estimate (15). Instead, we follow the general approach of Manrique-Vallier and Reiter (2014). The basic idea is to treat the observed data X * , which we assume includes only feasible households and individuals (e.g., there are no errors that create impossible combinations in the observed data), as a sample from an augmented dataset X = (X * , X 0 ) of unknown size. We assume X arises from a NDPMPM model that does not restrict the characteristics of households or individuals; that is, all combinations of households and individuals are allowable in the augmented sample. With this conceptualization, we can construct a Gibbs sampler that appropriately assigns zero probability to combinations in S and results in draws of θ from (15). Given a draw of X , we draw θ from the NDPMPM model as in Section 2, treating X as if it were collected data. Given a draw of θ, we draw X 0 using a rejection sampling scheme. For each stratum h ∈ H defined by unique household sizes in X * , we repeatedly simulate households with individuals from the untruncated NDPMPM model, stopping when the number of simulated feasible households matches n * h . We then make X 0 the generated households that fell in S.
We now state a result, proved in Appendix B, that ensures draws of θ from this rejection scheme correspond to draws from (15). First, we note that each record in X is associated with a household-level class assignments corresponding to the n 0 cases in X 0 . Let n 0h be a random variable corresponding to the number of households of size h generated in X 0 . If we set p(n 0h | n * h ) ∝ 1/(n * h + n 0h ) for all h ∈ H, we can show that (with a slight abuse of notation) The proof is similar to the one provided by Manrique-Vallier and Reiter (2014) be shared. Thus, we also evaluate a measure of disclosure risk; that is; we quantify the probabilities that intruders can learn the values of original data records given the released synthetic data. To save space, we briefly summarize results of the disclosure analysis here; full details are in the online supplement.
Specifically, we generate L synthetic datasets, Z = (Z (1) , . . . , Z (L) ), by sampling L datasets from the posterior predictive distribution of a NDPMPM model. We generate synthetic data so that the number of households of any size h in each Z (l) exactly matches n * h . This improves the quality of the synthetic data by ensuring that the total number of individuals and household size distributions match in Z and X * .
As a result, Z can be viewed as partially synthetic data, even though every released Z ijk is a simulated value.
To make inferences with Z we use the approach in (Reiter, 2003). Suppose that we seek to estimate some scalar population quantity Q. For l = 1, . . . , L, let q (l) and u (l) be respectively the point estimate of Q and its associated variance estimate computed with Z (l) . Letq L = l q (l) /L;ū L = l u (l) /L; b L = l (q (l) −q L ) 2 /(L−1); and T L =ū L +b L /L. We make inferences about Q using the t−distribution,

Application without structural zeros
We use data comprising n = 10000 households and N = 18782 individuals randomly sampled from the 2012 ACS public use file. For illustration, we disregard the stratification by census tract and act as if the data are a simple random sample. We discuss approaches to incorporating the stratification in Section 5. We use the three household-level variables and ten individual-level variables summarized in Table 1.
Household sizes range from one to eight, with values of (n 1 , . . . , n 8 ) = (3118,5398,1149,275,45,10,4,1 ijp ), from (2) using W ij and the corresponding probabilities in φ. We repeat this process L = 5 times, using approximately independent draws of parameters obtained from iterations that are far apart in the MCMC chain.
To check the utility of the synthetic datasets, we compare the relationships among the variables across the original dataset and the 5 synthetic datasets. In particular, we look at the marginal distributions of all variables, bivariate distributions of all possible pairs of variables, and trivariate distributions of all possible triplet of variables. We plot the probabilities in the average of the 5 synthetic datasets against the corresponding ones in the original dataset. Figure 1, 2 and 3 show that the dots are closely following the y = x line, meaning that the synthetic datasets are preserving the relationships among the variables very well. For the bivariate and trivariate distributions, we restrict to only the cells with at least 10 counts.

Application with structural zeros
In this scenario, we use data comprising n = 10000 households and N = 26732 individuals randomly sampled from the 2011 ACS public use file. We again disregard stratification issues. Here, we select variables to mimic those on the decennial census, including a variable indicating relationships among individuals within the same household. This creates numerous and complex patterns of impossible combinations. We restrict the sample to households with two, three, or four individuals. We exclude households with only one individual because these individuals by definition must be classified as household heads, so that we have no need to model the family relationship variable. To generate synthetic data for households of size one, we can use non-nested versions latent class models (Dunson and Xing, 2009;Manrique-Vallier and Reiter, 2014). We exclude households with n i > 4 for presentational and computational convenience. We discuss issues related to computation further in Section 5. We use one household-level variable and ten individual-level variables, summarized in Table 3. Household sizes take on the values (n 2 , n 3 , n 4 ) = (5398, 2481, 2121).
We run the MCMC sampler for the NDPMPM model of Section 3 for 10000 iterations, setting F=30 and S=10. To check for convergence of the MCMC chain, we look at traceplots of π, α and β. The posterior mean of the number of household-level classes occupied by households in the original data X is 23.65 with a 95% central interval of (18,29). Within household-level classes, the posterior means of the number of individual-level classes occupied by individuals in the original data X ranges from 8 to 10.
The posterior mean of the number of households generated in X 0 is 930870, with 95% credible interval (467730,1306100). The number of augmented records keeps increasing and does not seem to stabilize.
Generating 10000 iterations took 5921 minutes on a standard desktop computer with an implementation in R.
As a byproduct of the augmented data approach in the MCMC sampler, at each MCMC iteration we create n households that satisfy all constraints. We use the households from a randomly sampled iteration (after convergence) to form Z (l) . We take the constraint-satisfying households from L = 5 iterations to form Z, using iterations that are far apart in the MCMC chain.
Similarly to the the simulation in Section 4.1, we check the utility of the synthetic datasets by comparing the marginal, bivariate and trivariate distributions of the variables. As shown in Figure 4, 5 and 6, the proposed truncated model preserves the relationships among variables very well. Same as before, we restrict to only the cells with at least 10 counts for the bivariate and trivariate distributions.  Table 4 shows the decent performance of the model on these measures. We also look at the percentages of certain types of households in the original dataset and the synthetic datasets, and the model shows great ability of capturing all these complex within-household relationships.  Table 4: First 3 rows: within household race for households of size 2, 3 and 4, with number of households for each size in parenthesis (the confidence intervals refer to the percentages of households where everyone in the household having the same race); last 15 rows: percentages of certain types of household.

Discussion
The simulation in Section 4.2 excludes households with n i > 4 for presentational and computational convenience. For this random sample of 2011 ACS public use file, it is a fairly challenging task to write down all the constraints, especially as the household size increases. In our simulation study, there are 11 constraints for households of size 2, 65 constraints for households of size 3 and 273 constraints for households of size 4. The larger the household size, the more constraints need to be specified for the sampling scheme. Computation is also challenging, as the number of generated impossible households increases along the MCMC chain, which takes longer for each iteration to complete. Parallel computing is a good solution for this situation, as the households generation and constraints checking can be paralleled easily. Overall we think our simulation of households of size 2, 3 and 4 is a sufficient demonstration for our proposed truncated model for nested categorical data dealing with impossible combinations. To apply this methodology in practice, a more systematic way of checking the constraints and the use of parallel computing are necessary.
As mentioned before, in our simulation study we notice that the number of generated impossible households is increasing along the MCMC chain and does not seem to converge after 10000 iterations.
When we compare the utility of the synthetic datasets generated earlier in the chain and the ones generated at the end of the chain, no big improvement can be achieved by running the simulation for longer.
Therefore we recommend to put an upper bound on the number of households to be generated for computational convenience. However the value for the upper bound is an empirical decision, and we recommend the user to check and compare the results at different iterations before making the decision.
Looking to future research, we see two big steps in this nested model. Appendix A: Full conditional distributions in posterior computation Step 1: for i = 1, . . . , n, sample G i ∈ {1, . . . , F } from a multinomial distribution with sample size one and probabilies Step 2: for each i = 1, . . . , n, j = 1, . . . , n i , sample M ij ∈ {1, . . . , S} given G i from a multinomial distribution with sample size one and probabilities Step 3: for g = 1, . . . , F − 1, sample u g from the Beta distribution, where mm g = n i=1 1(G i = g). Set Step 4: for m = 1, . . . , S − 1, sample v gm from the Beta distribution, where Step 5: for g = 1, . . . , F , and k = p + 1, . . . , q, sample a new value of λ (k) g from the Dirichlet distribution Step 6: for g = 1, . . . , F , m = 1, . . . , S and k = 1, . . . , p, sample a new value of φ Step 7: sample a new value of α from the Gamma distribution Step 8: sample a new value of β from the Gamma distribution

Appendix B: Data Augmentation Strategy and Proof
Procedures for generating households and individuals The detailed procedures for generating a sample of household and individuals to obtain an augmented dataset X = (X * , X 0 ) are described in the following. Since the algorithm is stratified based on household size, X = {(X * h , X 0 h ), h ∈ H}. Given the previously drawn X , we draw θ = {π, ω, λ, φ} from their posterior distributions under the NDPMPM model, as described in Step 3 to Step 6 in Appendix A. To generate the i-th household of size n i = h, Step 1: The household level latent class assignment G i = g is generated from G i |π ∼ Multinomial(π 1 , . . . , π F ) Step 2: Given the sampled G i = g, all the household level variables k = p + 1, . . . , q are generated from Step 3: For each of the j = 1, . . . , h household members, the individual level latent class assignment M ij = m given sampled G i = g is generated from M ij | G i = g, ω ∼ Multinomial(ω g1 , . . . , ω gS ) Step 4: Given the sampled G i = g and M ij = m, all the individual level variables k = 1, . . . , p, j = 1, . . . , h are gnereated from Step 5: After obtaining this household record x ij , we apply a rejection sampling scheme to check if x ij ∈ S h . If yes, then this record will be included in the set X 0 h , and if no then the count of simulated feasible households of size h increases by 1.

Repeat
Step 1 to Step 5 until the number of simulated feasible households of size h equals to n h , the corresponding number in X * h . Do these for all h ∈ H and we obtain an augmented dataset X = {(X * h , X 0 h ), h ∈ H}.
Proof of result (16) in Section 3 Notations follow Section 3. In order to obtain samples from p(θ|X * , {n * h }) in the truncated latent structure model from a sampler for p(θ, G * , G 0 , M * , M 0 , X 0 , {n 0h : h ∈ H} | X * , {n * h : h ∈ H}) under the untruncated model, we need to show that p(θ|X * , {n * h }) = p(θ, G * , G 0 , M * , M 0 , X 0 , {n 0h : h ∈ H} | X * , {n * h : h ∈ H})dX 0 dη * dη 0 dψ * dψ 0 d{n 0h } By Bayes rule, Also, each component can be expressed in the following: where f is the generic pmf of the likelihood and the generic pmf of the latent class assignments. G h and M h denote the latent class assignments associated with households of size h.
Therefore, Equation (18) can be expressed based on the size stratification as follows: