Linear Logistic Scoring Equations for Latent Class and Latent Profile Models: A Simple Method for Classifying New Cases

Researchers are often interested in using latent class or latent profile parameter estimates to obtain posterior class membership probabilities for observations other than those of the original sample. In this paper, we demonstrate that these probabilities typically take on the form of linear logistic equations with coefficients which are functions of the original model parameters. In other words, the posterior class membership probabilities can be computed with a prediction formula similar to that of a multinomial logistic regression model. We derive the scoring equations for nominal, ordinal, count, and continuous indicators, as well as investigate models with missing values on class indicators, local dependencies, covariates, or multiple latent variables. In addition to the mathematical derivations of the scoring equations, we describe how either exact or approximate scoring equations can be obtained by estimating a multinomial regression model using a weighted data set.

In applications of factor analysis, after selecting and estimating the factor model of interest, one will typically obtain (linear) factor-score equations which can be used to estimate the subjects' factor scores as a function of the original items included in the model (Bartholomew, Knott, and Moustaki, 2011).An important feature of factor-score equations is that these can be used not only for the subjects in the estimation sample, but also for new subjects, that is, for out-ofsample prediction.
When performing a latent class (LC) analysis, after selecting the final model, one may assign the individuals in the estimation sample to LCs using their posterior class membership probabilities.However, it is not well known that these posterior probabilities can be expressed exactly by a set of linear logistic equations, with "regression" weights which are functions of the original LC model parameters.More specifically, a closed-form expression for the posteriors exists, as a function of the LC model parameters, if the responses are modeled using distributions from the exponential family and with canonical link functions.Availability of a set of scoring equations makes it straightforward to compute the class membership probabilities for subjects which do not belong to the original sample used to estimate the LC model.In this way, one can realize an important goal of many LC analysis applications, namely obtaining out-of-sample class membership predictions.The main advantage of this approach is it allows predicting class memberships of new subjects without the need to use LC analysis software or to program the formula of the estimated LC model.Note that this is similar to what is done in factor analysis, where factor scores are obtained using a linear factor-score formula, without the need to return to the estimates of the factor covariances, factor loadings, and residual covariances.
As far as we know, LatentGOLD (Vermunt andMagidson, 2016, 2021) is currently the only software for LC analysis that allows one to obtain these logistic scoring equations, both in tabular form and in the form of SPSS or R syntax.The aim of this paper is to show how these equations are derived.As will be shown, the slopes of the linear logistic scoring equations are obtained easily, but the expression for the intercept terms (constants) may be somewhat more complex.In many situations, the equations for the posteriors will contain only main effects of the response variables.However, as in quadratic discriminant analysis, when a LC model for continuous responses assumes variances to be class specific, quadratic terms also need to be included, and when the LC model contains covariances/associations which are class specific, interactions are also required.The approach can be extended easily to LC models with covariates and multiple latent variables.More complicated are situations where responses contain missing values (in which case the constants need to be adapted to the missing data pattern), where the model contains direct effects of covariates on the responses (in which case the exact logistic form may no longer hold), and where non-canonical link functions are used (in which case there is no longer any direct relation between the LC model parameters and the scoring equation).
Rather than computing the scoring equations from the LC model parameters, one can also obtain these equations by estimating a multinomial logistic regression model using the posteriors as weights, as done in the LatentGOLD Step3-Scoring option (Vermunt andMagidson, 2016, 2021).This approach has the advantage of increased flexibility in that it is also possible to obtain approximate equations when exact closed-form solutions are not available, or when one prefers a simpler approximate set of scoring equations over more complex exact equations.
Below, we present the scoring equations for models for categorical responses (nominal, ordinal, and counts), models for continuous responses, models with local dependencies, models with covariates, models with missing values on responses, and models with multiple latent variables.We also discuss how the scoring equations can be obtained using a weighted multinomial logistic regression analysis.

Latent Class Models for Categorical Responses
Let   denote one of  response variables (or indicators), with 1 ≤  ≤ .A particular response for and the number of categories of the th response variable are referred to as   and   , respectively, with 1 ≤   ≤   .The probability of having a particular set of responses  is denoted by ( = ).The discrete latent variable is denoted by , a particular latent class by , and the number of classes by .

Nominal Responses
The standard LC model for nominal responses has the following form (Collins and Lanza, 2010;Goodman, 1974a/b;Hagenaars, 1990;McCutcheon, 1987): where ( = ) is the probability of belonging to class  and (  =   | = ) the conditional probability of giving response   on variable   conditional on belonging to class .
Replacing the model probabilities by their logit equations yields: It should be noted that the   * terms are identical to the one-variable parameters for the latent classes in the log-linear formulation of the LC model proposed by Haberman (1979).In this formulation, the joint distribution of the  and , ( = ,  = ), is modelled as follows: The posterior class membership probability is obtained as ( = | = ) = ( = ,  = )/ ∑  =1 ( = ,  = ).As can be seen, since ∑  =1    and  cancel, ( = | = ) has exactly the form we derived above.Thus, when using Haberman's log-linear formulation, the constants of the scoring equations are also model parameters.However, an important disadvantage of this formulation is that it is computationally less efficient since parameter estimation involves processing the cell entries in the joint cross-tabulation of  and all   variables.Therefore, this log-linear approach can be used only when the number of response variables is small.

Ordinal Responses and Counts
In the LatentGOLD program for LC analysis (Vermunt andMagidson, 2016, 2021), ordinal response variables can be modeled using an adjacent-category logit model, that is, using a canonical link function (Agresti, 2002).More specifically, these are multinomial logit models in which the class-indicator association parameters are restricted as follows:     =     ; that is, to be nominal-by-linear (Goodman, 1979;Heinen, 1996).This implies that for ordinal variables, .
The same restrictions imposed on the     parameters also apply to the scoring equations; that is, for ordinal variables, we replace the     terms in the scoring equations by     .This shows that in the ordinal case, the class-membership logits are linear functions of the item responses.It should be noted that while we assumed that the category scores ranged from 1 to   , the adjacent-category logit model allows for any type of scoring.In its more general form,     =      , where    is the score for category   .
When modeling ordinal variables using other (non-canonical) link functions, such as cumulative logit or cumulative probit link functions, exact expressions for the scoring equations no longer exist.As will be shown below, a possible way out is to estimate the scoring equations treating the response variables as either numeric or nominal predictors of class membership.
Also for a Poisson and binomial count variable, the scoring equations contain the term     .This can be seen from the fact that the class-specific density of a count variable   takes on the following form: where it should be noted that     cancels from the scoring equation.The expression for log  changes compared to the nominal and ordinal case.For Poisson counts, log  = exp(  +   )  , where   is the exposure; and for Binomial counts, log  =   log(1 + exp(  +   )), where   represents the number of trials.This shows that scoring equations should also include terms for the exposure (number of trials) when this number varies across individuals.When the   are fixed, as for nominal and ordinal variables, the log  terms can be included in the constants   * .

Local Dependencies
Thus far, we assumed that responses are independent within classes.Now we will look at the scoring equations for LC models with local dependencies (Hagenaars, 1988;Magidson and Vermunt, 2004;Oberski, Kollenburg, and Vermunt, 2013)

Missing Values
When some indicators have missing values, the LC model for the observed values   can be defined as follows: where   = 1 if the response variable concerned is observed and 0 when it has a missing value (Vermunt et al., 2008;Vermunt and Magidson, 2016).Note that this formulation implies that the product is taken over the observed responses only.Therefore, similar to subjects with complete data, the computation of the posteriors for subjects with missing values involves using only their observed responses.This means that the sum ∑  =1     should be taken over the observed variables only or, equivalently, that     should be set to 0 for the missing value category.
However, the sum ∑  =1 log(  ) which is subtracted from the constants should also be taken over the observed variables only, implying that each pattern of missing data has its own set of constants   * .A way to account for this is by using the same   * for all observations but adding a term log(  ) to the scoring equation when variable  has a missing value.In other words, in order to deal with missing data, the scoring equation should be expanded to include the term ∑  =1 log(  ) (1 −   ).Note that this approach can be used with any missing data pattern occurring among the new subjects for which one wishes to obtain the posteriors.
A special type of missing data occurs when the LC model is estimated using  variables, but only the first  1 of these are to be used for classification purposes (where  =  1 +  2 ); for example, this situation may occur if one wishes to ignore the last  2 variables when calculating the classification probabilities because this information will not be available when performing out-of-sample predictions.In this case, the posteriors are obtained as follows: .
As can be seen, only the slope parameters and the normalizing constants of the first  1 response variables will enter into the scoring equations., respectively, implying that these cancel from the scoring equations because they do not depend on the class.This yields a set of scoring equations similar to those obtained in linear discriminant analysis (Hastie, Tibshirani, and Friedman, 2008).

An Example with Five Dichotomous Indicators
In the more general case of multivariate normal responses with unrestricted covariances Σ  , the LC model becomes with As can be seen, the scoring equations now not only contain linear and quadratic terms, but also interaction terms are needed.More specifically, denoting an entry of   (Banfield and Raftery, 1993), and mixture growth models (Muthén, 2004).For these models, the same scoring equations can be used as when means and covariances are unrestricted.

Covariates
When covariates are included in the model, the latent class probabilities are typically modeled as a logistic function of the covariates (Bandeen-Roche et al, 1997;Dayton and McReady, 1988;Yamaguchi, 2000).That is, .
Here,  denotes the vector of covariates, and  0 and   represent the constants and the regression parameters for covariate   .
Since the denominator does not depend on the class, it cancels from the formula for the posterior class membership probability and thus also from the scoring equations.Assuming the response variables are nominal, the posterior probability of class membership given responses and covariates becomes: .
That is, the covariate terms can simply be added to the scoring equations.Note that when the direct effect of a covariate is allowed to be class specific, or equivalently, when an interaction term is included, the indicator-covariate interaction should also be added to the scoring equation.Again, the scoring equations will be exact only when the covariate concerned is nominal or dichotomous.For example, this is the specification used in multiple-group LC models in which response probabilities may be allowed to differ across subgroups for one or more indicators (Clogg and Goodman, 1984;Eid, Langeheine, and Diener, 2004;Kankaras, Moors, and Vermunt, 2010).

Multiple Latent Variables
Suppose the LC model contains two latent variables  1 and  2 instead of one, so that we have a LC Factor or Discrete Factor model.Such a model, has the following form (Goodman, 1974b;Hagenaars, 1990;Magidson and Vermunt, 2001;Vermunt and Magidson, 2005): As in the single latent variable case, the ( 1 =  1 ,  2 =  2 ) and (  =   | 1 =  1 ,  2 =  2 ) can be modelled using logistic regression models (Magidson and Vermunt, 2001).For example, Also in this case, the posterior probabilities can be written as functions of the LC model parameters; i.e., where the   1  2 * contain the  terms and the normalizing constants   1  2 .
Note that the above equation for two latent variables can easily be generalized to an arbitrary number of  latent variables.The logistic scoring equation then becomes: ) .
It should be noted that the marginal posterior probabilities (  =   | = ), which are obtained by collapsing over the other latent variables, can not be written as logistic functions.
However, logistic approximations of the marginal posteriors may be precise enough in most applications.Below, we discuss how such approximations can be obtained.

Estimating the Scoring Equations Using Logistic Regression Analysis
Rather than computing the scoring equations from the parameters of the LC model, it is also possible to obtain these equations posthoc using a standard routine for multinomial logistic regression analysis.This involves the following three steps: 1.After selection of the final LC model, save the posterior class membership probabilities to an output file.This is a feature available in all software packages for LC analysis.Depending on the situation, in the third step the responses and covariates are modeled as either nominal or numeric predictors, quadratic and/or interaction effects are added, and missing value dummies are included.For count variables, one should include the exposures (or total number of trials) as additional numeric predictors when these differ across individuals.Steps 2 and 3 are automated in the LatentGOLD program (Vermunt andMagidson, 2016, 2021), and are called Step3-Scoring.
This approach can be used not only for the posthoc computation of the exact scoring equations, but also for obtaining approximate scoring equations.This is useful when an exact form does not exist, such as when direct effects of numeric covariates on indicators were included in the LC model or when non-canonical link functions were used for ordinal variables, as well as the situation where one prefers a set of simplified equations, say without quadratic or interaction terms, that are almost as good as the exact ones.An example of the latter can be seen in Table 3, which reports the approximate scoring equations for the diabetes.datexample presented above, but leaving out the quadratic terms of  1 and  2 and their interaction terms.
The approximate equations predict the class memberships almost as well as the exact equations; that is, the entropy R 2 equals 0.817, while its original value equals 0.833.
When ordinal variables are modeled using non-canonical link functions, such as cumulative logit or probit models, we have two options.Option 1 is to compute the exact scoring equations by treating the response variables as nominal predictors in the posthoc logistic regression analysis; that is, by making use of the fact that the estimated (  =   | = ) based on an ordinal model can be reproduced perfectly by an unrestricted multinomial model.Option 2 is to estimate the scoring equations using the response variables as numeric predictors, which in fact implies that the estimated (  =   | = ) from the original LC model are approximated by an adjacent-category logit model.
As shown above, in LC models with multiple latent variables, an exact set of logistic scoring equations exists for the joint class membership probabilities, but not for the marginal class membership probabilities.The posthoc estimation method can also be used to obtain approximate scoring equations based on the marginal posteriors.Applying these equations will be simpler than first computing the joint and subsequently collapsing over the other latent variables, especially with models containing more than two latent variables.The quality of the resulting approximation can be assessed by a goodness-of-fit measure.

Discussion
As in continuous latent variables models, in LC models it is important to have a simple scoring rule for predicting a person's value on the latent variable.In this paper, we showed that for LC models this scoring rule has the form of a linear logistic equation, with weights which are simple functions of the original LC model parameters.We derived the exact scoring equations for nominal, ordinal, count, and continuous response variables, for local independence and local dependence models, for models with covariates, for models with multiple latent variables, and for models with missing values on some of the indicators.Moreover, we discussed several situations in which exact scoring equations may not exist, such as LC models with direct effects of covariates on the indicators and LC models in which the conditional response distributions are restricted using regression models based on non-canonical link functions.
We also explained how to compute exact or approximate scoring equations with the saved posterior probabilities from any LC analysis program.This can be achieved with standard routines for logistic regression analysis.In practice, this may be much easier than computing the scoring equations from the LC model parameters, where the constants from the scoring equations may be somewhat more tedious to obtain.
While not discussed explicitly, the computation of the scoring equations proceeds in exactly the same manner in LC models for mixed responses; that is, in LC models for combinations of nominal, ordinal, count, and continuous indicators (Hennig and Liao, 2013;Hunt and Jorgensen, 1999;Vermunt and Magidson, 2002).The only thing that needs to be done in the computation of the scoring equations is to collect the terms for the different indicators, irrespective of their scale types.When using the posthoc method based on a logistic regression analysis, things are even easier.Nominal indicators are used as nominal predictors, and ordinal, count and continuous indicators as numeric predictors.Depending on the situation, quadratic and/or interaction terms may also need to be included.
The scoring equations discussed in this article can be used to obtain point estimates for the posterior probabilities, not only for subjects in the original sample, but also for new subjects.
However, an issue not dealt with in this paper is the uncertainty about those estimates.Since the "regression" weights of the scoring equation are sample estimates, when deriving a prediction, it would be better to take into account this sampling variability.Note that the weights are functions of the original model parameters, for which we have the estimated asymptotic variancecovariance matrix.A possible approach to obtain the covariance matrix of the weights involves sampling say 100 parameter sets from their estimated multivariate normal distribution and computing 100 sets of weights.Other options to explore are the delta method and bootstrapping (Dias and Vermunt, 2008).Our future research will focus on this important topic.

2.
Create an expanded data set with  records per subject, which contains a column with the class number taking on values from 1 to , a column with the posterior probability for the person and class concerned, and columns for the response variables and covariates used in the LC model, the latter columns containing the same values repeated in each of the K records for each subject.3. Estimate a logistic regression model in which the posteriors are used as weights.The class number is the dependent variable, and the responses and covariates are the predictors.
Here,   are intercept or constant terms in the regression model for ( = ), and    and     are intercept and slope parameters in the regression model for (  =   | = ).
=   ,   =   | = ) = exp(   +    +     +     +      +       )    +    +     +     +      +       ). is a model with an association between   and   ,      , and an interaction with the latent classes,       .In other words, it represents a model in which the strength of the local dependency is allowed to vary across classes.As can be seen, one difference with the local independence model is that the normalizing constants entering in the   * coefficients of the scoring equations should be computed per set of locally dependent variables.The      term cancels from the scoring equations because it does not depend on the classes.In contrast, the term       becomes part of the scoring equations, which in the case of class-specific local dependencies not only contains main effects but also interaction terms.Note that when local dependencies are not class-specific, that is, when       = 0, the only remaining difference between local independence and local dependence models concerns the computation of the constants   * .The scoring equations in local-dependence LC models for ordinal variables are very similar to those for nominal variables.When the ordinal variables are modelled using an adjacent-category logit specification,      =       and       =       .The scoring equations will contain the term       when the interaction parameters   are not fixed to 0.
. In the most general case, including a local dependency between (nominal) response variables   and   implies that (

Types of Latent Class and Mixture Models Continuous Responses
Table1provides an example illustrating the computation of the scoring equations for an application with five dichotomous response variables.It concerns the model with three latent classes estimated for the LatentGOLD "political.sav"demodataset.The upper part of Table1gives the estimates of the model parameters   ,    , and     using dummy coding with the parameters for the first class and the first item category fixed to 0. The lower part gives the values of   * and log(  ), where for consistency we also use dummy coding for the log(  ) Next, we sum the obtained values across the two item categories and take the log, yielding 1.29562, 0.3682, and 0.8884 for the three classes.Because of the dummy coding, we subtract the value of the first class, yielding the reported log( 1 ) values 0.0000, -0.9275, and -0.4073.The log(  ) values of all items are subtracted from the   values to get the intercepts   * of the scoring equations, and the slopes     can be used without any modification in the scoring equations.In the case of a missing value, the slope parameters for the item concerned equal log(  ).Appendix A shows R code generated for this application by LatentGOLD, which can be used for classifying new observations.
(McLachlan and Peel, 2000) ) values for the first item, we first compute exp (  1 +   1  ), which for all three classes equals 1.0000 for  1 = 1, and 2.6533, 0.4452, and 1.4312 for  1 = 2. Now, let us turn to LC or mixture models for continuous response variables(McLachlan and Peel, 2000), also referred as latent profile models.In a local independence model with normal within-class distributions with possibly unequal variances, the response distributions have the following form:When variances are assumed to be equal across classes, the first and last term of the above univariate normal distribution become − Table2reports the model parameters and the scoring equations for a LC model with three continuous indicators (Glucose, Insulin, and SSPG) from the LatentGOLD "diabetes.dat"demo data set.It is a three-class model with a free residual covariance between the first two class indicators and with class-specific residual (co)variances.The scoring equation for this model contains not only linear terms, but also quadratic terms as well as the interaction terms between  1 and  2 .
Covariates may also have direct effects on the indicators.Let us assume we have a single covariate  which has a direct effect on the categorical response variable   ; that is, (  =   | = , ) = exp(   +     +    )    +     +    ).As can be seen, in this model, the normalizing constants depend on the covariate value, meaning that we no longer have a single log  which can be subtracted from   .Because the log | are neither linear functions of the covariate values, they cannot be added to the linear term for the covariate concerned.In other words, the exact linear logistic representation of the posterior probabilities collapses in this situation, though, as discussed below and in Appendix B, it may still be used as an approximation.An exception is the situation in which the covariate is a nominal or dichotomous variable, in which case exact scoring equations can still be obtained by subtracting log | from the   terms of the covariate concerned.

Table 1 :
Latent class model parameters and scoring equation parameters for the political.savdata example

Table 2 :
Latent class model parameters and scoring equation parameters for the diabetes.datdata example

Table 3 :
Approximate scoring equation parameters for the diabetes.datdata example