BayesLCA : An R Package for Bayesian Latent Class Analysis

The BayesLCA package for R provides tools for performing latent class analysis within a Bayesian setting. Three methods for ﬁtting the model are provided, incorporating an expectation-maximization algorithm, Gibbs sampling and a variational Bayes approximation. The article brieﬂy outlines the methodology behind each of these techniques and discusses some of the technical diﬃculties associated with them. Methods to remedy these problems are also described. Visualization methods for each of these techniques are included, as well as criteria to aid model selection.


Introduction
Populations of interest can often be divided into homogenous subgroups, although such groupings may never be explicitly observed.Commonly, it is of interest both to identify such divisions in a population of interest and succinctly describe their behavior.Latent class analysis (LCA) is a type of model-based clustering which concerns the study of binary data drawn from such populations.Examples of such data can include the correct or incorrect answers submitted during an exam (Bartholomew and Knott 1999), the symptoms presented by persons with major depressive disorder (Garrett and Zeger 2000), or a disability index recorded by long-term survey (Erosheva, Fienberg, and Joutard 2007).
While the R (R Core Team 2014) environment for statistical computing features several packages for finite mixture models, many are concerned with continuous data, such as mixtures of multivariate Gaussian distributions (mclust, Fraley and Raftery 2007), regression models (flexmix, Leisch 2004), or some combination therein (mixtools, Benaglia, Chauveau, Hunter, and Young 2009).While functions to perform LCA are available in packages such as e1071 (Dimitriadou, Hornik, Leisch, Meyer, and Weingessel 2014) and in particular poLCA (Linzer and Lewis 2011), these limit the user to performing inference within a maximum likelihood estimate, frequentist framework.No dedicated package for performing LCA within a Bayesian paradigm yet exists.
The main aims of the BayesLCA package are: Cluster observations into groups.
Perform inference on the model posterior.This may be done by obtaining maximum a posteriori (MAP) and posterior standard deviation estimates; iteratively sampling from the posterior; or by approximating the posterior with another distribution of a convenient form.
Report both parameter estimates and posterior standard deviations.
Provide plotting tools to visualize parameter behavior and assess model performance.
Provide summary tools to help assess model selection and fit.
Users of the package may also wish to include prior information into their analysis, something outside the scope of frequentist analysis.There may be multiple reasons for doing so: expert information may be available; certain boundary solutions may be viewed as unrealistic, and as such to be avoided; differing prior information may be supplied to the model in order to check the robustness of results.However, in this paper we do not dwell on this issue, instead emphasizing the methods for parameter estimation and visualization which are available in the package.
Note that while the package emphasizes inference within a Bayesian framework, inference may still be performed from a frequentist viewpoint.For example, the use of the expectation maximization (EM) algorithm, together with the specification of (proper) uniform priors for all model parameters, is the equivalent of obtaining the maximum likelihood estimate of the parameters.Conversely, some of the methods available in BayesLCA are beyond the scope of frequentist approaches.
There are three main inferential functions in BayesLCA: an EM algorithm (Dempster, Laird, and Rubin 1977), a Gibbs sampler (Geman and Geman 1984) and a variational Bayes approximation (see Ormerod and Wand 2010, for an introduction to this method).Bishop (2006) contains excellent overviews of all three techniques.While each of the methods has been shown to be highly effective, associated difficulties can occur when attempting to perform inference.For example, label switching can often occur when using a Gibbs sampler, while headlong algorithms such as the EM algorithm and variational Bayes approximation can be sensitive to the starting values with which they are specified.While remedies for these difficulties are available, some may be more effective than others and the use of all three methods in combination may provide insight into the reliability of the statistical inferences being made.It is hoped that the paper may also serve as a useful introduction for those unfamiliar with the methods.
The rest of this paper is organized as follows: model specification for LCA is briefly outlined in Section 2. Demonstrations of the main inferential functions are provided in Sections 3, 4 and 5 respectively.These three sections each follow a similar format.Firstly, a broad overview of the inferential technique is given.The method is then applied to randomly generated synthetic data in order to illustrate its application in a simple manner.Some of the technical issues associated with the method are then discussed while being applied to the Alzheimer dataset included with the package.A description of both datasets is given in Section 2.1.Additional features are discussed in Section 6, and a summary of the package, as well as potential future extensions to the package are briefly discussed in Section 7.
All computations and graphics in this paper have been done with BayesLCA version 1.6.The latest version of the package should always be available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=BayesLCA.

Latent class analysis
Latent class analysis involves performing inference within a mixture model framework, where the class of distribution is restricted to be a Bernoulli distribution.Let X = (X 1 , . . ., X N ) denote M -dimensional vector-valued random variables, composed of G sub-populations, more commonly referred to as classes or components.Two sets of parameters, the G-dimensional vector τ and G × M dimensional matrix θ, underly the model.These are referred to as the parameters for class and item probability respectively.The parameter τ g denotes the prior probability of belonging to group g, so that τ g ≥ 0 and G g=1 τ g = 1.The parameter θ gm denotes the probability, conditional on membership of group g, that . By making the naïve Bayes assumption (Hand and Yu 2001) that observations are conditionally independent based on group membership, the density of each X i can then be written as where p(X i |θ g ) = M m=1 p(X im |θ gm ).Direct inference using Equation 1 is typically difficult.The inference techniques in Sections 3, 4 and 5 are all predicated on the introduction of missing data Z = (Z 1 , . . ., Z N ).Each Z i = (Z i1 , . . ., Z iG ) is a G-dimensional vector, representing the true class membership of X i as a multinomial random variable.That is, suppose that the true group membership is known for each X i , and is denoted by The complete-data density for an observation Since Z i is not known, the posterior probability for the class membership of observation i is given by To determine the complete-data posterior distribution p(τ , θ|Z i , X i ), we first assume conjugate prior distributions p(τ |δ) and p(θ gm |α gm , β gm ), for τ and θ gm , with corresponding hyperparameters α gm , β gm , δ g ∈ R + for each g ∈ {1, . . ., G} and m ∈ {1, . . ., M }.That is, we assume that they follow Dirichlet and beta distributions respectively, which have the following forms: for each g ∈ {1, . . ., G} and m ∈ {1, . . ., M }.
Making these assumptions yields the posterior distribution (Garrett and Zeger 2000) Note that, as with all mixture models, latent class models are only identifiable up to a permutation of labels (Redner and Walker 1984;Leisch 2004).While this does not affect the inference methods discussed in Sections 3 and 5, issues do occur when employing Gibbs sampling methods, the details of which are discussed in Section 4.2.1.For convenience, once parameter estimates have been calculated in BayesLCA, labels are automatically indexed by decreasing order of size with respect to the class probability parameter τ , so that, e.g., group 3 will be the third largest group in the model.

Using BayesLCA
To begin, first load the BayesLCA package into R.
R> library("BayesLCA") While not strictly necessary, setting the seed at the start will ensure identical results to those produced in this paper.

R> set.seed(123)
First, we simulate some data X to perform inference on.This can be done using the command rlca().We generate a 500-row, 4-column matrix X with two underlying latent classes with parameters τ = 0.4 0.6 and θ = 0.8 0.8 0.2 0.2 0.2 0.2 0.8 0.8 with the code R> tau <-c(0.4,0.6) R> theta <-rbind(rep(c(0.8,0.2), each = 2), rep(c(0.2,0.8), each = 2)) R> X <-rlca(500, itemprob = theta, classprob = tau) If X is inspected, it can be seen that many rows of the data are repeated.This stands to reason, as only 2 4 = 16 unique combinations are possible.To store the data in a more efficient manner which is compatible with BayesLCA, use the command: This amounts to a list, consisting of a matrix of unique data patterns and a vector storing the number of times each pattern occurs.
We will also apply our methods to a dataset of patient symptoms recorded in the Mercer Institute of St. James' Hospital in Dublin, Ireland (Moran, Walsh, Lynch, Coen, Coakley, and Lawlor 2004;Walsh 2006).The data is a recording of the presence or absence of six symptoms displayed by 240 patients diagnosed with early onset Alzheimer's disease.Analysis of this dataset will highlight some of the technical issues associated with the inference techniques to be discussed.It is loaded into the R terminal using: R> data("Alzheimer") R> alz <-data.blca(Alzheimer)

EM algorithm
In an EM algorithm (Dempster et al. 1977), the expected complete-data log-posterior, which in this case is defined to be is iteratively maximised with respect to θ and τ , where θ (k) and τ (k) denote the values of θ and τ at iteration k (Tanner 1996).A lower bound of the observed-data log-posterior log p(θ (k) , τ (k) |X), each successive estimate of θ (k) and τ (k) with respect to Q(θ, τ |θ (k) , τ (k) ) in turn forces log p(θ, τ |X) to increase also, so that log p(θ (k+1) , τ (k+1) |X) ≥ log p(θ (k) , k) ).In this way local maxima for θ and τ are obtained (or a saddle point is reached).At the kth stage of the algorithm, parameter estimates are updated in two steps: Here Θ and T denote the sample spaces for θ and τ respectively, i.e., Θ is the The E and M steps are repeated until the algorithm is deemed to converge.
In practice, the algorithm proceeds by executing: E-step: . M-step: Note that while estimation of τ (k) necessarily includes the constraint that the values sum to 1 (Bartholomew and Knott 1999), no such constraint is imposed to obtain θ.Only the binary nature of X and suitable prior values ensure that an estimate θ gm lies in the interval [0, 1].Posterior standard deviation estimates for the estimated parameter values may be obtained from the observed information (Bartholomew and Knott 1999;Linzer and Lewis 2011), or by using bootstrapping methods.These methods will be discussed further in Section 3.2.It is also possible to determine whether the algorithm has converged to a local maximum rather than a saddle point, by checking whether all eigenvalues of the observed information matrix are positive.

Synthetic data
Let's first use the EM algorithm method to estimate θ and τ from our synthetic data X, ignoring for the moment additional model output.The standard function to call when analysing data is blca(), and then using the argument method to specify how inference is to be performed, e.g., method = "em".Alternatively, one can use blca.as a prefix, and then specify the method directly afterwards, for example blca.em().To fit a 2-class model to the data using the EM method, with any additional arguments set to their default values, simply use the command: Note that the item and membership probability estimates are close to the true values of τ and θ.We can also run the algorithm again, this time with prior values explicitly specified: Comparing the estimates for τ and θ of fit2 to fit1, we see that the results are highly similar, up to three decimal places.In this case, using a somewhat informative prior has had little impact on the result in comparison with a uniform, or non-informative, prior.

R> fit2$classprob
[1] 0.6507996 0.3492004 ,] 0.2409866 0.2234701 0.7578669 0.7630877 [2,] 0.8492732 0.8253584 0.1584750 0.1431960 Use names(fit1) to see all items returned.The blca.em function returns an object with classes "blca.em"and "blca", for which print, summary and plot methods exist.The summary command provides the prior values specified to the model, as well as other information, such as the number of iterations the algorithm ran for and the value of log-posterior found by the estimates.Multiple figures may be produced by the plot method, some of which depend on the inference method used.These are specified by the argument which.Two figures common to all blca objects illustrate parameter values and classification uncertainty respectively.The first is a type of mosaic plot, where the y-axis denotes the size of each class, and the x-axis denotes each column of the data.Each cell's colour indicates the probability of a symptom being present, with high values having a bright yellow colour and low values having dark red.The second plot is another mosaic plot, with datapoints ordered along the x-axis according to size, and colour denoting class membership.R> plot(fit1, which = 1:2) These plots are shown in Figure 1.From inspection of the figure on the right, we can see that around half the points in the dataset are clustered with high levels of certainty, while the other half are still quite well distinguished.

Alzheimer data
We now perform an EM algorithm analysis to the Alzheimer data, confining our interest to 3-class models.While expert information may be available for the data, we shall assume uniform priors for all parameters.

Local maxima
A difficulty with EM algorithms is that while the log-posterior is increased at each iteration, they may converge to only a local maximum or saddle-point.To combat this, in blca.em the algorithm is restarted multiple times from randomly chosen starting values, keeping the set of parameters achieving the highest log-posterior value.See Section 6 for a description of the different ways which these values can be generated.The argument restarts is used to determine the number of times the algorithm should be run, with default setting restarts = 5.We perform an analysis of the data with a 3-class model.The following output is given by default.The warning message will be discussed in the next section: R> sj3.em <-blca.em(alz,3) Posterior standard deviations will be 0 for these values.
From only five starts, the algorithm obtains three distinct local maxima of the log-posterior, each with a corresponding alternative set of parameter point estimates.If a sub-optimal set of estimates were incorrectly identified as obtaining the global maximum, then a very different interpretation of the dataset might be provided, potentially leading to a flawed analysis.In this case, it seems sensible to run the algorithm more times in order to identify the optimal parameters.The following code runs the algorithm for twenty times instead: R> sj31.em<-blca.em(alz,3, restarts = 20) R> plot(sort(sj31.em$lpstarts),sequence(table(round(sj31.em$lpstarts, 2))), + main = "Converged Values", xlab = "Log-Posterior", ylab = "Frequency") A point stacked histogram of the log-posterior values is shown in Figure 2.This suggests that while the algorithm converges to several distinct sets of point estimates, the estimates attained by sj31.emdo in fact appear to be the global maximum.

Posterior standard deviation estimation
So far, each model fitted using an EM algorithm has by default failed to return estimates of the posterior standard deviations.There are two ways of doing this in BayesLCA.Firstly, by using asymptotic methods.These can be obtained in two ways: setting the argument sd = TRUE when initially calling to blca.em, or, if the model has already been fit, by using the function blca.em.sd.The posterior standard deviation estimates for the model fit1 fitted to the synthetic data X is given by the following code.Note that as well as the estimates, a convergence score is returned, in this case indicating that the algorithm converged to at least a local maximum, rather than a saddle point.
R> blca.em.sd(fit1, x) However, such a method is not without drawbacks.For example, point estimates occurring on the boundary must be obmitted from the observed information, as entries for estimates which are exactly 1 or 0 are undefined.Thus models such as sj31.emreturn posterior standard deviations of zero for some parameters when using this method.Unreliable posterior standard deviation estimates can also occur for parameter estimates occurring close to the boundary (Bartholomew and Knott 1999, Chapter 6).
R> blca.em.sd(sj31.em,alz) An alternative method of posterior standard deviation estimation employs bootstrapping methods (Wasserman 2007, Chapter 3).In particular, empirical bootstrapping methods involve sampling the data with replacement and re-fitting parameters to the new data.When done multiple times, posterior standard deviation of the parameter estimates may be obtained.It would also be possible to obtain estimates using parametric bootstrapping methods, whereby new datasets are generated from the fitted parameters of the model, and bootstrap samples then obtained by re-fitting to the newly generated data.However, this method may be comparatively unstable for our purposes; for example, when generating data from a model containing a group with low probability of membership, there is a non-negligible probability that some of the generated data samples will omit the group entirely.
For each bootstrapping run, values from the originally fitted model are used as the initial starting values for the EM algorithm, which is then run on the re-sampled data.This helps the algorithm to converge comparatively quickly over most samples.The use of these starting values should also help prevent any label-switching of parameter values from occurring during the bootstrapping run; a label-switching algorithm may also be used by specifying the argument relabel = TRUE, although this comes at an additional computational cost.See Section 4.2 for further details.Note that some Posterior standard deviations are exactly 0, indicating that identical parameter estimates were obtained from all bootstrap samples.blca.boot may be applied to a previously fitted object, or be called directly, with a new model fitted to the data as an initial step in the code.There is also a plot function, which includes density estimates for the parameters for item and class probability.Figure 3 shows the density estimates for the item probability parameters fitted by sj3.boot.

Gibbs sampling
Gibbs sampling (Geman and Geman 1984) is a Markov chain Monte Carlo (MCMC) method typically used when direct sampling of the joint posterior distribution is intractable, but sampling from the full conditional distribution of each parameter is reasonably straightforward.By iteratively sampling from each conditional distribution in turn, samples of the joint posterior distribution are indirectly obtained.The method relies on the Markov assumption, in that samples drawn at iteration k + 1 depend only on parameter values sampled during the previous iteration k: k+1) , τ (k+1) , α, β, δ).
In this way, parameter samples are repeatedly drawn, until it is decided that a reasonable representation of the joint posterior distribution has been obtained.The following draws are thus made during a sample run (Garrett and Zeger 2000): for each g ∈ {1, . . ., G}, m ∈ {1, . . ., M } and i ∈ {1, . . ., N }.

Synthetic data
Using a Gibbs sampler, parameters can be fit to the synthetic data using the following code: R> fit2 <-blca.gibbs(x,2) R> plot(fit2, which = 3) Density estimates for θ are shown in Figure 4, with the true values of θ clearly lying within areas of high density.Posterior standard deviation estimates are returned as standard.

Identifiability
To begin with, run the Gibbs sampler over its default settings, with a burn-in of 100 iterations and thinning rate of 1.The relabel setting will be explained shortly: R> sj3.gibbs <-blca.gibbs(alz,3, relabel = FALSE) R> par(mfrow = c(4, 2)) R> plot(sj3.gibbs,which = 5) A diagnostic plot of sj3.gibbs is shown in Figure 5.This provides clear evidence of labelswitching, the phenomenon whereby group labels become permuted, causing multiple parameter spaces to be explored in the same run.There are many methods to deal with this problem: one approach (Stephens 2000; Celeux, Hurn, Robert, and P. 2000; Marin, Mengersen, and Robert 2005) is to employ a post-hoc re-labelling algorithm which attempts to minimise the posterior expectation of some loss function of the model parameters.Another approach is to impose an ordering constraint on parameters; however, this can lead to poor estimation (Celeux et al. 2000).
In BayesLCA a similar approach to Wyse and Friel (2012) is taken when dealing with this issue.The method re-labels samples by minimising a cost function of the group membership vector Z.Let Z (1) denote the first value of Z stored during the sample run.For any subsequent sample nh .The function matchClasses() from the e1071 package (Dimitriadou et al. 2014) is then utilized to find the permutation σ so that the diagonal entries of C are maximised.In this way the agreement between Z (1) and Z (j) is also maximised.The permuted values for θ σ and τ σ are then stored.

Burn-in, chain mixing
Two other key issues associated with Gibbs sampling are burn-in and mixing.Inspecting samples with respect to these terms helps to indicate the validity of the obtained model estimates.We again run the default model, except this time correcting for any relabelling which may occur.

R> sj30 <-blca.gibbs(alz, 3, relabel = TRUE)
Inspecting diagnostic plots of the model provides insight into how it has performed, without providing many clues as to how performance can be improved.One solution is to make use of diagnostic methods such as raftery.diag,available in the coda package (Plummer, Best, Cowles, and Vines 2006).This package is automatically loaded with BayesLCA, so there is no need to explicitly do so to make use of its functions.Objects of class "blca.gibbs"can be converted to type "mcmc" using the function as.mcmc.
R> raftery.diag(as.mcmc(sj30.gibbs)) Quantile (q) = 0.025 Accuracy (r) = +/-0.005Probability (s) = 0.95  This output suggests that the sampler converges quickly (burn-in values are low), but is not mixing satisfactorily (note the high dependence factor of many parameters).A Gibbs sampler with better tuned parameters can then be run:

Burn-in
R> sj31.gibbs<-blca.gibbs(alz,3, burn.in= 150, thin = 1/10, iter = 50000) R> par(mfrow = c(3, 2)) R> plot(sj31.gibbs,which = 3) Figure 6 shows a diagnostic plot for this model, created using the same code as for Figure 5.The behavior of the parameters in this case is much more satisfactory.Note that other functions in the coda package, such as the plot and summary methods for "mcmc" objects can also be used in this way.Density plots for θ obtained from this sampling run are shown in Figure 7.

Variational Bayes
So far methods which attempt to maximize the joint posterior p(Z, θ, τ | X, α, β, δ), or iteratively sample from its conditional distributions have been discussed, in Sections 3 and 4 respectively.Variational Bayes methods can be thought of as an approximate combination of both techniques, in that they can be used to obtain parameter estimates which maximize a fully factorized posterior approximation to the joint posterior.In the general case, it can be shown that, for some latent parameters Ω and arbitrary distribution q, a lower bound for the log-posterior log p(X) can always be obtained, via Jensen's inequality: with the discrepancy between the left and right hand sides of Equation 3 being equal to the Kullback-Leibler divergence KL(q||p) (Kullback and Leibler 1951).Supposing that Ω can be partitioned into J parameters, such that Ω = {Ω 1 , . . ., Ω J }, and restricting the family of distributions to which q(Ω) belongs to such that it can be shown that the distribution q * j which minimises KL(q||p) has the form: where the notation E i =j [log p(X, Ω)] denotes that the expectation of the log-posterior log p(X, Ω) is taken with respect to all model parameters excepting Ω j , i.e., Ω i , where i = 1, . . ., j − 1, j + 1, . . ., J.
In practice, the form of q * j (Ω j ) will be the same as that of the conditional distribution p(Ω j | X, Ω 1 , . . ., Ω j−1 , Ω j+1 , . . ., Ω J ), with the key difference that its parameters are independent of the other variational parameters in the model.Updating each parameter iteratively, à la the EM algorithm, then maximizes q(Ω), and by extension p(X, Ω).
The first step in applying this method to LCA is to introduce the variational distribution where ζ, τ and φ are all variational parameters.These have the following distributions, for each i ∈ {1, . . ., N }, g ∈ {1, . . ., G}, and m ∈ {1, . . ., M }: and the variational parameters take the following updates: where Ψ denotes the digamma function (Abramowitz and Stegun 1965).

Synthetic and Alzheimer data
The following code fits 2-and 3-class models to the synthetic data X and Alzheimer data using variational Bayes methods: This is a common feature of variational Bayes approximations, in that the enforced independence between parameters results in diminished variance estimates.Conversely, it is this restriction which allows such quick density estimation.

Model selection
It is worth mentioning two additional properties of the variational Bayes approach which distinguish it from the EM algorithm.Firstly, that saddle point convergence issues , such as those discussed in Section 3.2, which are often encountered when fitting EM algorithms, are largely avoided (Bishop and Corduneanu 2001).The second is that, if the model is over-fitted, with Dirichlet prior values δ < 1 specified to the model, redundant components are emptied out, rather than the model becoming over-fitted (Bishop 2006).One method for determining an appropriate number of classes to fit to the Alzheimer data is to deliberately over-fit the model, and then consider only the classes for which τ g > 0 (Bishop and Corduneanu 2001).This is achieved by the following lines of code: R> sj10.vb<-blca.vb(alz,10, delta = 1/10) Restart number 1, logpost = -1299.97... Warning message: In blca.vb(alz, 10, delta = 1/10) : Model may be improperly specified.Maximum number of classes that should be fitted is 9. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 20 40 60
The plotted values of the lower bound are shown in Figure 9.The multiple jumps in the lower bound indicate where components have "emptied" out.

Starting values
There are multiple ways to specify starting values in BayesLCA.This is done by specifying the start.valsoption.The argument is used to assign initial values to the group membership vector Z, either by specifying a method for how values are assigned, or by specifying the values directly.The default method is start.vals= "single", whereby each unique data point is randomly assigned membership to a single class, that is, Z i ∼ Multinomial(1, 1/G, . . ., 1/G), for each i ∈ 1, . . ., N.
Starting values may also be specified using the Zscore function.For example, when attempting to fit a 2-class model to the synthetic data X, class membership can be specified with respect to the true values of τ and θ: R> Z1 <-Zscore(x$data, classprob = tau, itemprob = theta) R> fit.true <-blca.em(x,2, start.vals= Z1) The returned output is almost identical to fit1, as fitted in Section 3.1.
Another practical use for specifying starting values would be to extend a Gibbs sampling run.By starting a run with values of Z sampled from the previous model, parameter samples may be drawn from the same region of sample space, negating the need for additional burn-in and, ideally, ensuring compatibility with the previous set of parameter samples.

Model selection
While the issue of over-fitting was discussed in the case of variational Bayes methods in Section 5.1, throughout Sections 3 and 4, the value of G, the number of underlying classes in the model, was assumed fixed and known.In this section, methods to determine the optimal number of classes to fit to a dataset using an EM algorithm or Gibbs sampling are discussed, with the Alzheimer data used as an illustrative example.Note that for latent class analysis, the number of classes which can be fit to data is at best limited by the condition G(M + 1) < 2 M , or else the model becomes unidentifiable (Goodman 1974;Dean and Raftery 2010).
While the question of identifying the appropriate number of clusters to fit to a model remains an area of ongoing research, several methods involving information criteria have been developed.Such methods are predicated on testing:  (Raftery, Newton, Satagopan, and Krivitsky 2007).In all cases, the model with the higher value with respect to a criterion is considered the better fit to the data.
We will compare these criteria for a 1, 2, and 3-class model fit to the Alzheimer data.The interested reader is invited to apply these methods to the synthetic data X and investigate whether a 2-class model is selected.

EM algorithm
Firstly, we fit 1 and 2-class models to the data.In practice, fitting a 1-class model to the data amounts to separately fitting a Beta distribution to each column of the data in the standard manner, with the local independence assumption outlined in Section 2 replaced by the assumption that the data is independently, identically distributed.That is, p(θ | X, α, β) = M m=1 p(θ m | X m , α m , β m ), where each p(θ m | X m , α m , β m ) follows a Beta( N i=1 X im + α m , N i=1 (1 − X im ) + β m ) distribution.For convenience, we fit the model the same way as the others : R> sj1 <-blca.em(alz, 1, restarts = 1) R> sj2.em <-blca.em(alz,2) Comparing the BIC and AIC of these models with that of the model sj31.emfitted in Section 3 gives the following output: R> c(sj1$BIC, sj2.em$BIC, sj31.em$BIC) [1] -1578.733 -1570.087 -1593.816R> c(sj1$AIC, sj2.em$AIC, sj31.em$AIC) [1] -1557.849 -1524.839 -1524.204The BIC indicates that the 2-class model is selected as the optimal fit, while the difference between the AIC for the 2 and 3-class model is very small.

Gibbs sampling
In the case of Gibbs sampling, we first run a 2-class model using similarly tuned parameters to sj31.gibbs from Section 4. We then compare the DIC between 1-, 2-and 3-class models, using the fact that the DIC of the 1-class model is equal to twice its log-posterior: R> sj2.gibbs <-blca.gibbs(alz,2, thin = 1/10, iter = 50000, burn.in= 150) R> c(2*sj1$logpost, sj2.gibbs$DIC, sj31.gibbs$DIC) [1] -1545.849 -1522.593 -1517.245This suggests that a 3-class fit may be best, although the difference between the 2-and 3-class models is small.Inspection of Figure 7 is also indicative of weak identifiability (Garrett and Zeger 2000), in that the posterior density of item probability parameters for the third group in the model are highly similar to their prior distributions (in this case, uniform Beta(1,1) distributions).This suggests that a 3-class model may be overfitting the data.

Discussion
In this paper we have demonstrated tools to perform LCA in a Bayesian setting using BayesLCA.The functions in this package do this by utilizing one of three methods: maximization of parameters a posteriori; sampling parameters via their conditional distributions; or by approximation of the joint posterior.While all three methods have been examined in detail, the computational cost of Gibbs sampling means that in many cases its implementation may be infeasible, particularly for data sets of a larger scale than those investigated here.Nevertheless, it remains something of a gold standard in terms of posterior estimation, and may be of benefit as a comparative tool, such as when the posterior standard deviation estimates of variational Bayes approximations were examined in Section 5. Future versions of the package may include a version of the method implemented using C code, which would substantially increase performance speed.
Currently, the package does not provide as many inference tools as poLCA: for example, it cannot be generalized to polytmous outcome variables, and cannot incorporate covariate information into a model.It is hoped to extend the package in the near future to include these features, so that a Bayesian alternative to such methods is available.A function to perform parametric bootstrapping may also be introduced, providing an alternative to the currently implemented non-parametric version.
In future versions of BayesLCA it may be of interest to include functions to perform collapsed Gibbs sampling (Nobile and Fearnside 2007).This has been successfully applied to latent block modeling (Wyse and Friel 2012), a class of models of which LCA is a subset.The method entails integrating out the item and class probability parameters τ and θ, and iteratively sampling from the posterior p(Z|X, α, β, δ).The method is primarily concerned with the clustering of data points, and parameter estimation can only be achieved via a post-hoc analysis.However, the comparative increase in speed between the collapsed and conventional sampler would be substantial.In addition the number of clusters can be included into the model as a random variable, providing an alternative method for model selection.

Figure 1 :
Figure 1: The figure on the left gives a visual indication of the underlying parameters of the model, while the figure on the right is a mosaic plot representing classification certainty.

FrequencyFigure 2 :
Figure 2: Point stacked histogram of values to which the EM algorithm runs converged.The algorithm appears to converge to several distinct sets of values.

Figure 3 :
Figure 3: Density plots for the bootstrap estimates of the item probability parameters fitted by sj3.boot.Note the spikes in the plots for the Affective, Diurnal and Aggression symptoms.This indicates that the parameter estimates in question remained unchanged over all bootstrap samples.

Figure 5 :
Figure 5: Clear evidence of label-switching taking place during the sample run.

Figure 7 :
Figure7: Posterior density plots for the item probability parameters with label-switching corrected for, and better tuned parameters.Groups are well separated for some symptoms, such as Aggression and Agitation, but not for others, notably Hallucination.Note the flatness of distributions from the third group (coloured blue), illustrating the lack of information gained by the model.

Figure 8 :
Figure 8: Approximate density estimates for θ using a variational Bayes approximation.

Figure 9 :
Figure 9: The sequence of lower bound values as the algorithm proceeds towards convergence.
Diagnostic plot of the sampling run with label-switching corrected for, and better tuned parameters.These plots make clear the high uncertainty surrounding the item probability parameters of the third group (coloured blue).