Approximate likelihood inference in generalized linear latent variable models based on the dimension-wise quadrature

We propose a new method to perform approximate likelihood inference in latent variable models. Our approach provides an approximation of the integrals involved in the likelihood function through a reduction of their dimension that makes the computation feasible in situations in which classical and adaptive quadrature based methods are not applicable. We derive new theoretical results on the accuracy of the obtained estimators. We show that the proposed approximation outperforms several existing methods in simulations, and it can be successfully applied in presence of multidimensional longitudinal data when standard techniques are not applicable or feasible. MSC 2010 subject classifications: 62H25, 62F12, 62P10, 62P25.


Introduction
Latent variable models are used in many research fields where it is of interest to study constructs that are unobservable but can be indirectly measured by indicators related to them. Examples can be found in sociology where questionnaires are utilized to measure attitudes and opinions, in educational testing where students ability is evaluated using exams, and in medicine where quality of life is assessed by function scores (e.g., been able to walk independently).
A general framework that includes a large variety of latent variable models is represented by the generalized linear latent variable models [18,28]. Their purpose is to describe the relationship between a set of p responses or items, and a set of q latent variables or/and random effects (with q < p). The latent variables are supposed to account for the association between the response variables, while random effects are usually introduced in the model to account for some unobserved heterogeneity.
The generalized linear latent variable models can be estimated either using the maximum likelihood method [18,8], or under the bayesian paradigm [5,6]. Here, we focus on the former. The maximization procedure requires the latent variables and/or random effects to be integrated out from the likelihood function, and in general analytical solutions do not exist. Numerical quadrature based methods represent a widespread solution to this problem and, among them, the adaptive Gauss Hermite quadrature is considered the gold standard [21,25]. However, it is computationally unfeasible with a large number of latent variables.
Alternatively, the Laplace approximation does not suffer from the curse of dimensionality since it does not require to solve any integral [4]. It has been applied to latent variable models by [8]. In the bayesian framework, [24] combined it with numerical integration to provide a fast and accurate method, the so called integrated nested Laplace approximation, to approximate the predictive density of the latent variables/random effects. The simplicity of the standard Laplace approximation has a cost related to the fact that the order of the approximation error is O(p −1 ). Moreover, it becomes less adequate as the degree of discreteness increases [10]. Several authors developed higher order versions of the method to improve the approximation error [30,27,23,22,7,2]. However, their implementation requires to compute many partial derivatives of the integrand.
In this paper, we propose a new approach to approximate the integrals involved in the likelihood of latent variable models, that we refer to as dimensionwise quadrature method. It consists of representing the integrand as a sum of not overlapping components that are the terms of the Taylor series expansion of the function. Truncating the expansion reduces the dimension of the integrals, and makes the computation feasible also when the number of latent variables in the model is large. Moreover, in common circumstances in which existing methods may not converge or have very high bias, the proposed technique still performs robustly and efficiently. The corresponding estimators are asymptotically as accurate as the adaptive Gauss Hermite estimators. Differently from higher order Laplace approximations, the proposed approach does not require any derivative computation, hence it is very simple to implement.
The paper is organized as follows. In Section 2, we illustrate the integration problem when a maximum likelihood estimation of the model is used, and we discuss the proposed dimension-wise quadrature method for multidimensional integration. The asymptotic properties of the corresponding estimators are also derived. The finite sample performance of the proposed estimators are analyzed in a simulation study presented in Section 3, together with an investigation of the approximation error. The application to empirical data is illustrated in Section 4, and Section 5 concludes the paper.

Setting
Maximum likelihood estimates in the generalized linear latent variable framework are typically obtained by using either the expectation maximization algorithm or the direct maximization through Newton-type algorithms. For a random sample of size n, the observed data log-likelihood function is where θ denotes the vector of model parameters, and f (y l ; θ) is the probability associated to the individual response pattern y l . Typically, the assumption of conditional or local independence is made, which states that the p observed variables y l are independent given the q latent variables b l . The conditional distribution g(y l | b l ) is referred to as the measurement part of the model and is taken from the exponential family. The prior density h(b l ) represents the structural part and can be any continuous density function that ensures unimodality of the integrand in (2.1). In general, the multidimensional integrals involved in (2.1) cannot be solved analytically. An approximation is needed, on which the bias and variance of resulting estimators depend. For the derivations illustrated here, we follow the notation by [27] based on summation convention, and, for simplicity, we omit the individual subscript l.

The proposed method for multidimensional integration
Let L(b) denote the logarithm of the integrand in (2.1). Our proposed approximation relies on the asymptotic expansion of L(b) around its mode b mo = arg max b∈R q L(b), that is where L i1,i2 (b mo ) are the second order partial derivatives of L evaluated at the mode, and (b − b mo ) i1,i2 refers to specific components of the vector b − b mo . ν(b) includes all the terms of order higher than two in the expansion. Grouping these terms with respect to the number of variables involved in the derivation, we obtain is a set of indices in which the index i 1 is repeated j 1 times, the index i 2 is repeated j 2 times, and so on, with w u=1 j u = k and are the kth order partial derivatives of L evaluated at the mode and taken with respect to w variables. The probability associated to the individual response pattern is where the expected value is taken with respect to a multivariate normal density function φ(b; b mo , Σ mo ) whose mean vector is given by the mode b mo and its covariance matrix is minus the inverse of the Hessian matrix of L(b) evaluated at its mode, that is Σ mo = −L i1,i2 (b mo ).
being the third sum over all partitions P = p 1 | . . . | p t of the k indices {1 j1 i 1 . . . 1 jw i w } into t blocks, each of size greater than or equal to three.
Each term t w , 1 ≤ w ≤ q, is the summation of derivatives involving w variables. For example, involves all the partitions in t blocks, each of size greater than three, of the set in which the index i 1 is repeated k times. That is, it encompasses the derivatives of any order taken with respect to just one variable. A different representation of the terms t w , 1 ≤ w ≤ q, is provided in the following Proposition. Proposition 2.1. Let the function exp{ν r (b)} = exp{ν(b k1,...,kr )}, 1 ≤ k 1 < . . . < k r ≤ q, define a summation of terms that contain at most r variables, being the other q − r variables fixed to the corresponding modes. It follows that Proposition 2.1 is proved in the Appendix A. Hence, the function exp{ν(b)} admits the exact representation known as cut high dimensional model representation [12,31].
being s << q or, equivalently, We refer to (2.4) as dimension-wise approximation. The corresponding approximated marginal density function results with C mo obtained from the Cholesky decomposition of the matrix Σ mo . w * tj = w tj π −1/2 , where b tj and w tj , j = 1, ..., s − r, are the classical Gauss Hermite nodes and weights, respectively [29]. (2.5) is obtained by applying what we call dimension-wise quadrature.
When s = 0, the quantity among squared brackets in (2.5) reduces to one, such that f a (y; θ) reduces to the Laplace approximation of the integral. In other words, the dimension-wise quadrature provides an approximation of f (y; θ) that is more accurate than the classical Laplace approximation, due to the inclusion of higher (than two) order terms in the Taylor series expansion of L(b). On the other hand, when no integral dimension reduction is performed, Step 1 Choose s and nq Step 2 For each l (1 ≤ l ≤ n) • Compute the mode b mo,l of the integrand in (2.1), and its Hessian evaluated • Obtain the approximate marginal density function fa(y l ; θ) of expression (2.5) Step 3 Get the approximation of the log-likelihood function la(θ) as the sum of log(fa(y l ; θ)) over l tj ] are the adaptive Gauss Hermite nodes and weights, respectively. The approximation of the log-likelihood function (2.1) is derived by (2.5) following the main steps detailed in Table 1, where the implementation of the proposed dimension-wise quadrature is described. In this study, its maximization is performed using a quasi-Newton algorithm in which the gradient and the Hessian matrix are obtained using numerical derivatives.

Statistical properties of the estimators
To investigate the properties of the dimension-wise quadrature based estimators, denoted by θ ML , the asymptotic behavior of the approximation (2.5) is analyzed, and a detailed derivation of the corresponding error rate is given in Appendix B. We establish the asymptotic properties of θ ML under the following regularity conditions. Condition 2.1. (θ) has a unique maximum at θ 0 ∈ Θ.

Condition 2.2.
The parameter space Θ is compact. Under concavity of the objective function, the approximated log-likehood a (θ), compactness of Θ can be replaced by the assumption that the true parameter vector θ 0 lies in the interior of the parameter space, and the estimator θ ML lies in the interior of a neighborhood containing θ 0 [19].
A formal proof of these conditions in presence of ordinal and binary data is given by [9] and [1], respectively. (2.6) Thus, θ ML is consistent as long as both n and p grow to ∞. A formal proof of Proposition 2.2 is given in Appendix C. The n −1/2 term comes from the standard asymptotic theory, whereas the p −[nq/3+1] term derives from the dimension-wise quadrature approximation. For n q ≥ 3, these estimators are more accurate than the classical Laplace one, that are of order O(p −1 ). In particular, the proposed estimators share the same accuracy of the adaptive Gauss Hermite one, but the dimension-wise quadrature method avoids the main computational limitations of the latter.
As discussed by [9] and [1] for the Laplace and adaptive Gauss Hermite based estimators, respectively, when numerical techniques are used, the corresponding estimators belong to the class of M -estimators. Hence,

Comparison with the adaptive Gauss Hermite quadrature
The dimension-wise quadrature method is more advantageous than classical and adaptive quadrature techniques from a computational point of view. When evaluating the individual marginal density f (y; θ) using the adaptive rule with n q quadrature points selected for each dimension, the total number of function or response evaluations is n q q . In contrast, function evaluations are required using the dimension-wise quadrature. Figure  1 shows how the ratio of these two function evaluation numbers varies with respect to s, the number of terms involved in the dimension-wise quadrature, in presence of eight latent variables and for n q equal to 5, 7 and 11. A reduction of the computational effort is achieved when the ratio The amount of reduction depends on both s and n q . For the univariate dimension-wise quadrature, the ratio has a magnitude of order 10 −4 , 10 −5 , and 10 −7 when n q = 5, 7 and 11, respectively. In the bivariate case (s = 2), the magnitude of the ratio is of order 10 −3 , 10 −4 , and 10 −11 when n q = 5, 7 and 11, respectively. Furthermore, even if not shown here, it is evident that the reduction is dramatically enhanced as the number q of latent variables in the model increases.
It is worth noting that the dimension-wise quadrature method is feasible in cases in which the classical quadrature techniques are not. At the moment, the adaptive rule is doable in presence of up to ten latent variables in the The choice of s is fundamental in the dimension-wise quadrature technique. Standard criteria like AIC and BIC are not applicable since they would lead to choose the highest value of s. Indeed, as s increases, the value of the loglikelihood typically increases due to the additional information included in it (for further details see [17]). This behaviour of the log-likelihood is shown in the simulation study described in the next Section. For this reason, we follow the same 'rule of thumb' applied by [17] for the choice of the number of quadrature points in the traditional quadrature techniques, that is we increase s until estimates stabilize. This rule is applied in the real data example in Section 4. We underline that, once the value s is selected, the accuracy of the estimates obtained by applying the dimension-wise quadrature can be improved by increasing the number of quadrature points selected for each dimension.

Monte Carlo simulation study
The performance of the proposed method is evaluated in finite samples by means of a Monte Carlo simulation study in presence of binary data. In this specific case, where η represents the systematic component of the model defined as η = α 0 + α T b. α 0 is a p-dimensional vector of item specific intercepts, and α is a structure matrix that contains unknown factor loadings. b = (b 1 , . . . , b q ) T is a vector of q factors that account for the associations among the observed variables, assumed to be normally distributed with zero mean vector and correlation matrix Ψ. The relationship between the systematic and the random component is expressed through a logit link function. The technical details for the application of the dimension-wise quadrature in this particular case are reported in Appendix D.
We consider a four-factor model with eight indicators. Moreover, we use five quadrature points per dimension to compare the performance of the proposed method with the Laplace approximation and the adaptive Gauss Hermite quadrature method, that occurs when the approximated integral is fourdimensional. The true values of the model parameters have been selected so that at least q 2 constraints are imposed to guarantee invariance to factor rotations [11]. Four constraints are obtained from the correlation matrix Ψ, being the main diagonal elements equal to one. This also ensures the identification of the scale of the latent variables. Further constraints are obtained by assuming a simple structure for α, that is by partitioning the manifest variables into four non-overlapping groups, each indicative of a different latent variable.
The true values of the free loadings are fixed to 1.5 and the correlation parameters to 0.6. This results in low correlations between the variables underlying the binary items. In this case, it is known that, in general, the Laplace approximation performs well.
We set n = 100, n = 500, n = 1000 and used 500 replicates for each sample size. Figure 2 reports the box plots of the Monte Carlo estimators for α 11 and for ψ 12 based on Laplace (s = 0), the dimension-wise quadrature method (s = 1, 2, 3), and the adaptive Gauss Hermite quadrature for the different sample sizes. The results for all the other parameters are very similar, and they are not reported here for simplicity.
The Laplace estimators are highly biased for both the loading and the correlation parameter. The bias does not decrease as n increases since the error rate of the approximation, that is O(p −1 ), always predominates as n → ∞. On the contrary, both the dimension-wise quadrature with s = 1, 2, 3 and the adaptive Gauss Hermite approximation produce estimators with small bias and the bias tends to zero as n increases. The Laplace approximation strongly underestimates the variability of the loading estimators compared with the gold standard adaptive Gauss Hermite quadrature. On the other hand, the variability of both the loading and the correlation estimators based on the dimension-wise quadrature is increasingly similar to the corresponding adaptive based estimators as s increases.
The computational performance of the algorithm based on the different approximation methods is analyzed in Table 2. The average log-likelihood (Avloglik ) increases for increasing values of s. As for the computational time (Avtime), the algorithm is slower when s = 3 and with the adaptive quadrature. For  example, when n = 500 and n = 1000, in these cases on average it requires about three times to converge compared to s = 2. Although, as expected, the Laplace approximation appears to be the fastest method, in certain circumstances it may not converge or has very high bias, as shown above. On the other hand, the proposed method still performs robustly and efficiently. In terms of number of function evaluations (Nrfev ) the methods perform quite similarly apart from the case of n = 100 where there are noticeable differences. The behaviour of the approximation methods in presence of high correlations between the underlying variables have been also investigated. The results are similar to those shown here, provided good parameter starting values [17].

Investigation of the approximation error
We analyzed the theoretical error induced by each approximation technique in two simple cases: a three and four-factor model. Table 3 illustrates the error produced by the dimension-wise quadrature and the Laplace method in approximating the marginal density f (y, θ) overall possible response patterns. For the former, we used five quadrature points for dimension coherently with the previous simulation study. The benchmark is represented by the adaptive Gauss Hermite quadrature with 15 quadrature points per dimension that provides a very good approximation of the function up to the ninth digit, as shown in the third column of Table 3. The differences are evaluated in terms of Maximum Absolute Error (Max Abs Error) and Average Absolute Error (Avg Abs Error) observed over all possible response patterns computed for each rule with respect to the benchmark approximation. As expected, the biggest differences are observed for the Laplace approximation. In both models, it provides the worst approximation, with a percentage of uncovered area around 5%. On the other hand, the dimension-wise method performs well, and better as s increases. Specifically, with s = 2 for the threefactor model and s = 3 in the four-factor model, the area under the curve is almost completely covered. The approximation provided by the dimension-wise quadrature method with s = 3 is, on average, equal to that obtained with the adaptive quadrature up to the sixth digit, and provides very similar results to the adaptive quadrature with the same number of quadrature points (n q = 5).

Data analysis
We analyze data from the Aging, Demographics and Memory Study, that is the first population based study of dementia in the United States (the data are available on the website http://www.rand.org). The participants, aged 70 or older individuals, received a thorough in-home clinical and neuropsychological assessment that allowed researchers to estimate the prevalence, predictors, and outcomes of dementia in the United States elderly population. They were asked to answer to the items of a questionnaire devoted to the cognitive assessment. A detailed description of the measures of the cognitive functioning can be found in [20]. For our analysis, we selected a subset of these measures corresponding to binary items concerning the following subtractions: (1)  Previous exploratory and confirmatory analysis on the Aging, Demographics and Memory Study data [14,13] highlighted that these items are measures of a latent factor called serial 7's test correspondent to a specific sub-scale related to the mental status of the interviewed people. Three waves of data (2001-2003, 2002-2005, 2006-2008) were collected on a sub-sample of n = 71 individuals. The aim is to investigate how the working memory and mental processing task in which respondents counted backward from 100 by 7s changes over time.
The analysis of multidimensional longitudinal data can be challenging since a large number of random effects/latent variables can be required to take into account for different sources of variability in the data.
We evaluate the performance of the proposed approach by applying the generalized linear latent variable models for multidimensional longitudinal data proposed by [6] and [3] to p binary variables observed at T different time points, such that the response vector y = (y 11 , . . . , y pT ) T is pT -dimensional. Beyond the association between several items at the same time point, their dependence over time as well as the variability of the same item at different time points have to be considered. The latter is accounted by p itemspecific random effects u = (u 1 , . . . , u p ) T , whereas the former is explained by means of T time-dependent latent factors z = (z 1 , . . . , z T ) T , such that the vector of latent variables b = (z, u) T is (p + T )-dimensional. The correlation of the latent variables z over time is modeled through a first order nonstationary autoregressive process of parameter φ and V ar( The elements of Γ are the variances and covariances of the latent variables z t , t = 1, . . . , T , that is is the indicator function and t < t . The measurement part of the model is specified as in (3.1). The linear predictor is characterized by a pTdimensional vector α 0 of item-and time-specific intercepts, and a matrix α of dimension (pT × (p + T )) whose generic sub-matrix of the j-th item is given by [α j I T O[j, 1 T ]]. I T is the identity matrix, and O[j, 1 T ] is a null matrix of dimension T × (T + 1) whose j-th column is substituted by a T -dimensional vector of ones.
In the longitudinal setting, the number of latent variables increases linearly with the number of observations. In this particular example, the number of latent variables and random effects is equal to eight (q = 8), being p = 5 and T = 3. Although for these data the adaptive Gauss Hermite approximation is feasible if five quadrature points are selected per dimension, its computational burden is too heavy. More than two hours are required for just one function evaluation. Thus, we only apply the dimension-wise quadrature method, whose computational advantages have been discussed before. We consider a fully constrained model (equal thresholds and equal loadings over time) in order to obtain the same metric of the latent variable at the three time points. The resulting measurement invariance allows comparisons of the factor across time [15,16].
As for the choice of s, we start fitting the model with s equal to 0 (Laplace approximation), and we increase its value until the mean of the absolute relative differences in parameter estimates ( ) is sufficiently small (order of 10 −3 ). In Table 4, we report the estimated model parameters under the proposed method using five quadrature points for s = 0, 1, 2, 3, 4. Parameter estimates sensibly change from s = 1 to s = 2 (the average 12 = 0.329) whereas they are all very close when s = 3 and s = 4 (the average 34 = 0.007). This indicates that, for this particular example, the additional information included in the likelihood when s is greater than 3 is almost irrelevant. From a computational point of view, there is a substantial difference in the time to convergence as the value of s increases, that is as a greater number of terms are involved in the approximation. The computational time for s equal to four is more than five times the computational time for s equal to three. It is evident that the latter can be considered as the reference solution. Table 5 shows the parameter estimates for this case with associated asymptotic standard errors in brackets. The significance of the variances of the random effects can be tested by using the likelihood ratio statistics (LR) for consecutive nested models that are asymptotically mixture of two weighted chi-squared distributions [26]. In Table 6, we report the values of these statistics with associated p-values. All the variances are not significantly different from zero, indicating that there is no relevant individual heterogeneity on each item over time. On the other hand, the autoregressive parameter is significant (φ = 0.99 with standard error equal to 0.15), and indicates a highly persistent latent process over time.

Discussion
One of the main problems in the estimation of the generalized linear latent variable models is that the integrals involved in the likelihood do not always have an analytical solution. A gold standard approach is represented by the adaptive Gauss Hermite quadrature, that is known to provide quite accurate estimates, but becomes computational unfeasible in presence of a large number of factors. A typical case is represented by generalized linear latent variable models for longitudinal data where the number of latent variables/random effects increases proportionally with the number of items.
In this paper, we have proposed a new method that overcomes these limitations since it decomposes the q-dimensional integral into the sum of 1, . . . , sdimensional integrals, being s << q, that can be easily approximated using classical Gauss Hermite quadrature techniques. We have shown that the dimensionwise quadrature uses less information than the adaptive based one but the estimators obtained with the two methods share the same asymptotic properties. Moreover, in finite samples, the dimension-wise quadrature based estimators present similar accuracy to the adaptive based estimators, whereas Laplace estimators tend to be less accurate in general.
We implemented the proposed method for generalized linear latent variable models for mixed data in R combined with Fortran, and the code is available from the authors upon request. However, the extension of the dimension-wise method to other models is straightforward.

Appendix A: Proof of Proposition 2.1
Let exp{ν r (b)} define the summation of terms that contains at most r variables, that is being the other q − r variables fixed to the corresponding modes. Using Taylor expansion at b mo , where the third sum is over all partitions P = p 1 | . . . | p t of k indices {1 j1 k i1 , . . . , 1 jw k iw }, 1 ≤ i 1 < . . . < i w ≤ r, w u=1 j u = k, into t blocks of size greater than or equal to three, being {1 j1 k i1 , . . . , 1 jw k iw } a set of indices in which i 1 is repeated j 1 times, i 2 is repeated j 2 times, and so on.

Appendix B: Asymptotic behavior of the dimension-wise quadrature approximation
Motivated by the simulation study and real data application, we derive the properties of the dimension-wise quadrature based estimators in generalized linear latent variable models for binary data. We consider a simple structure for the loading matrix, that is the manifest variables are partitioned into q nonoverlapping groups of size p i , i = 1, . . . , q, each indicative of a different latent variable. The logarithm of the integrand in (1) is where q i=1 p i = p, α 0j , j = 1, . . . , p i , i = 1, . . . , q, are item specific intercepts, and α ji , j = 1, . . . , p i , i = 1, . . . , q, are the factor loadings. The p i 's have the same order, that is min(p i ) = O(p). Ψ is a correlation matrix for identification reasons. To analyze the statistical properties of the approximated integral solution (1), we consider the expansion of exp{ν(b)} given in Corollary 2.1, that is where the third sum is over all partitions P of the k indices {1 j1 i 1 . . . 1 jw i w }, in which the index i 1 is repeated j 1 times, the index i 2 is repeated j 2 times, and so on, such that w u=1 j u = k and i 1 < i 2 < . . . < i w . In particular, P = p 1 | . . . | p t is a partition of the k indices {1 j1 i 1 . . . 1 jw i w } into t blocks, each of size greater than or equal to three. Based on (B.1), the exact solution of the integral is given by where f L = (2π) q/2 | Σ mo | 1/2 exp{L(b mo )}. Since only even order moments of multivariate normal variables do not vanish, both partitions P and Q refer to 2k indices {1 j1 i 1 . . . 1 jw i w }, such that w u=1 j u = 2k. Specifically, Q = q 1 | . . . | q k denotes the partition of these 2k indices into k blocks, each of size two. Each component L qm (b mo ), m = 1, . . . , k, refers to specific elements of the matrix Σ mo . Then, the approximation of f (y; θ) based on the dimension-wise quadrature can be rewritten as