Hybrid unadjusted Langevin methods for high-dimensional latent variable models

The exact estimation of latent variable models with big data is known to be challenging. The latents have to be integrated out numerically, and the dimension of the latent variables increases with the sample size. This paper develops a novel approximate Bayesian method based on the Langevin diffusion process. The method employs the Fisher identity to integrate out the latent variables, which makes it accurate and computationally feasible when applied to big data. In contrast to other approximate estimation methods, it does not require the choice of a parametric distribution for the unknowns, which often leads to inaccuracies. In an empirical discrete choice example with a million observations, the proposed method accurately estimates the posterior choice probabilities using only 2% of the computation time of exact MCMC.


Introduction
Latent variable models are widely used in econometrics due to their ability to capture complex stylised facts of economic data.Recent examples include mixed effects models (Danaher, 2023), discrete choice models (Loaiza-Maya and Nibbering, 2023), and state space models (Koopman et al., 2017).At the same time, many modern econometric estimation problems involve a large data set, which often results in a large number of latent variables in these models.Estimation of these high-dimensional latent variable models is challenging.
Large numbers of latent variables complicate both maximum likelihood and Bayesian estimation.Maximum likelihood estimation is often infeasible because it is not possible to analytically integrate the latent variables from the likelihood.Bayesian estimation deals with the latent variables by targeting the augmented posterior: the joint posterior distribution of the model parameters and the latent variables.To this end, exact Bayesian estimation methods, like Markov chain Monte Carlo (MCMC) methods, require generation of the latent variables.This typically induces high autocorrelation in the Markov chain, which makes exact Bayesian methods also computational infeasible for many big data problems.
A promising approximate Bayesian method from the machine learning (Vollmer et al., 2016;Hodgkinson et al., 2021) and the statistics literature (Dalalyan, 2017;Nemeth and Fearnhead, 2021) is the unadjusted Langevin algorithm (ULA).To generate samples from an approximation to the posterior, it utilizes the Euler-Maruyama discretization to approximate a continuous-time Langevin process based on the gradient of the logarithm of the posterior density.This discretization requires a choice of step size, which can substantially improve the rate of convergence relative to exact MCMC methods, at the cost of an approximation bias that increases in the dimension of the posterior (Hodgkinson et al., 2021).When applied to latent variable models, ULA would target the augmented posterior.For big data problems the dimension of the augmented posterior is large, which makes the approximation error substantial and the method highly inaccurate (Durmus and Moulines, 2017).Metropolis-adjusted Langevin algorithms (MALA) have been proposed to correct for this bias (Roberts and Tweedie, 1996), but these suffer from large computational costs and poor estimation results in high dimensions (De Bortoli et al., 2021).
Our main contribution is the development of the hybrid unadjusted Langevin algorithm (HULA), which is a computationally feasible and accurate estimation method for challenging high-dimensional latent variable models.HULA directly targets the marginal posterior of the model parameters rather than the posterior that is augmented with latent variables.Our method requires evaluation of an unbiased gradient estimate for the logarithm of the marginal posterior density, which we construct via the Fisher identity (Poyiadjis et al., 2011).This unbiased estimate requires the gradient of the logarithm of the augmented posterior density to be available in closed form, and efficient generation from the exact conditional posterior distribution of the latent variables.HULA has four main advantages.First, for an appropriate choice of step size, the rate of convergence can be substantially improved while the approximation error is small.By targeting directly the marginal posterior density, the HULA approximation error does not increase with the sample size but only with the dimension of the parameter space.Second, the method samples all model parameters at once rather than in blocks, and only requires the evaluation of a gradient with respect to the model parameters instead of both the model parameters and the latents as in ULA.Third, HULA is applicable to a wide range of econometric models, for instance, state space models, limited dependent variable models, and random coefficient models.Fourth, for certain models, the computational cost of HULA can be further reduced via the use of subsampling methods.Subsampling methods have also been shown to substantially improve the scalability of ULA in models without latent variables (Nemeth and Fearnhead, 2021).
We illustrate the potential of HULA in a big data discrete choice setting.We fit a multinomial probit model to more than a million observations on purchases from ten pasta brands.Exact MCMC methods for this model are known to suffer from high autocorrelation, due to the large number of latent utilities in the model.Compared to exact MCMC, HULA exhibits substantially faster convergence and higher effective sample sizes.At the same time, we observe that the estimates for the posterior choice probabilities of HULA and MCMC are indistinguishable, and that the loss in predictive accuracy is negligible.Since the multinomial probit model allows for subsampling, the computational costs can be further reduced by taking a random subsample of 20% of the observations to estimate the gradient.The accuracy of the posterior choice probabilities and posterior predictive show almost no decline, using only 2% of the computation time of exact MCMC.
The limitations of exact MCMC methods for the estimation of latent variable models has led to the development of other approximate Bayesian approaches.For instance, the econometrics literature has adopted variational Bayes (VB) methods in the estimation of state space models (Quiroz et al., 2022), choice models (Loaiza-Maya and Nibbering, 2023), and tobit models (Loaiza-Maya et al., 2022).Instead of sampling from the exact posterior, VB calibrates an approximation to the posterior using fast optimization techniques.This approach requires the researcher to specify a class of approximating densities, which can be challenging for complex models with many latent variables.Moreover, VB is known to accurately estimate posterior means but underestimates posterior variance (Blei et al., 2017).This may bias other objects of interests that are nonlinear transformations of the parameters, such as the choice probabilities from a multinomial probit model.This is in fact the case in our empirical application, where we demonstrate that HULA produces more accurate posterior choice probabilities than VB for some choice alternatives, while the estimation time of HULA with subsampling is of the same order.
The outline of this paper is as follows.Section 2 introduces Bayesian estimation via Langevin dynamics.Section 3 introduces the hybrid unadjusted Langevin algorithm for the estimation of econometric models with a large number of latent variables.Section 4 illustrates the method in a choice model with many latent variables.Finally, Section 5 concludes.
2 Bayesian estimation via Langevin dynamics

Setting
Consider a data vector y = (y ⊤ 1 , . . ., y ⊤ n ) ⊤ with n observations and a corresponding set of latent variables z = (z ⊤ 1 , . . ., z ⊤ n ) ⊤ , where z is an m z -dimensional vector.This paper considers parametric models that admit an augmented likelihood function of the form p(y, z|θ) = p(z|θ) where θ denotes an m θ -dimensional parameter vector.This includes, for instance, state space models, limited dependent variable models, and random coefficients models.
Bayesian estimation is concerned with the computation of the posterior distribution p(θ|y) ∝ p(θ) p(y, z|θ)dz, for a given set of prior beliefs on θ represented by the prior density p(θ).Generally, this posterior has an intractable analytical form as evaluation of p(y, z|θ)dz is often a complex integration problem.Instead, computation of (2) typically involves the use of MCMC methods that generate samples from the augmented posterior distribution which automatically generates samples from the desired posterior p(θ|y).This approach requires generation from the conditional posterior distribution p(z|y, θ), which typically induces high autocorrelation in the Markov Chain.Hence, exact MCMC methods may require a large number of draws, which makes it computationally impractical for models with large sets of latent variables.

The unadjusted Langevin algorithm
An unadjusted Langevin algorithm for (2) would be based on the m-dimensional Langevin stochastic differential equation where B t is an m-dimensional Brownian motion process, with m = m z +m θ .Given certain regularity conditions, the continuous-time dynamic process in (4) has been shown to have p(θ, z|y) as its invariant distribution (Roberts and Tweedie, 1996;Durmus and Moulines, 2017).However, sampling from ( 4) is not feasible.
The Langevin diffusion process can be approximated using the Euler-Maruyama discretization, which leads to the recursive formula where ǫ k ∼ N(0 m , I m ) and τ is a scalar determining the step size of the discretization.The process of iteratively sampling from ( 5) is referred to as ULA (Durmus and Moulines, 2017).Since the discretization may induce a bias, this algorithm generates draws from an approximation to p(θ, z|y).The step size τ controls the trade-off between the rate of convergence of the sampling algorithm and its approximation error.
Provided that the gradients in (5) can be efficiently computed, ULA is faster than traditional MCMC algorithms.In each sampling iteration, all elements of θ and z are sampled at once rather than in blocks.Moreover, at the cost of inducing more bias in the approximate posterior, ULA can also lead to faster convergence (Hodgkinson et al., 2021), and thus requiring less sample iterations.
However, both the approximation error and the computational costs of ULA can be large.First, the total variation norm between the exact posterior and the ULA approximation increases in m (Dalalyan, 2017;Durmus and Moulines, 2017).Although the approximation error can be corrected with a Metropolis-Hastings step (Roberts and Tweedie, 1996), this step is time consuming, deteriorates convergence properties, and may lead to poor estimation results (De Bortoli et al., 2021).Second, evaluation of the gradient ▽ z log p(θ, z|y) can be extremely costly for problems in which the number of latent variables is large; for instance, problems with millions of observations.

Hybrid Unadjusted Langevin
In this paper we propose a novel ULA method that directly approximates the posterior p(θ|y).Our method is based on the Langevin diffusion process where B t is an m θ −dimensional Brownian motion process.The discrete approximation of the ULA corresponding to (6) can be written as Since the approximation bias of the ULA increases in its dimension, the implied stationary posterior of ( 7) is subject to less bias than that of (5).Moreover, (7) includes a lowdimensional gradient compared to (5).To implement the ULA in ( 7), one must compute the gradient which requires evaluation of the intractable integral p(y, z|θ)dz.We assume that the gradient of the prior ▽ θ log p(θ) is available in closed form.We make ULA applicable to econometric models with latent variables by constructing an unbiased estimate of the gradient ▽ θ log p(θ|y), without evaluating the intractable integral in (8).To this end, we employ the Fisher identity, as used in for example Poyiadjis et al. (2011), which shows that the gradient of the marginal likelihood can be expressed as While the Fisher identity also involves an intractable integral, it allows us to construct an unbiased estimate of it as where z (s) ∼ p(z|y, θ) is a draw from the exact conditional posterior distribution of the latent variables.This draw can be produced either using exact Monte Carlo simulation or using an exact MCMC sampling scheme.For most econometric models, these draws can be readily produced.For instance, the states in state space models can be sampled via a Metropolis-Hastings step.The latent utilities in the multinomial probit model, which is a limited dependent variable model, can be generated one at a time via a Gibbs sampler.Regression models with random coefficients allow for exact sampling of the random coefficients.
The proposed sampling method makes ULA for p(θ|y) feasible by incorporating exact sampling from p(z|θ, y) to evaluate an unbiased estimate of the gradient G θ .We refer to our method as the Hybrid Unadjusted Langevin Algorithm (HULA).The step size τ in HULA trades-off speed of convergence for accuracy to p(θ|y).In contrast to ULA applied to p(θ, z|y), HULA does not incur any approximation error related to the posterior of the latent variables.Hence, HULA has the potential of faster convergence than exact MCMC methods, with a small loss in accuracy.

Subsampling
HULA generally requires a relatively small number of iterations, making it faster than exact MCMC.However, both methods have to generate from p(z|y, θ), which can be computationally costly when the number of latents in z is large.For certain types of models, the computational burden of HULA can be further reduced via the use of subsampling techniques.
For models where one can write p(z|θ) = n i=1 p(z i |θ), we have that which has a computational cost that increases linearly in n.Our method allows us to reduce the computational burden by considering the representation where A ∼ f (A) denotes a random subset of indexes from {1, . . ., n}, sampled without replacement.By introducing (12) into (9), we can show that a subsample unbiased estimate of G θ can be written as where A (s) ∼ f (A) and z (s)

Choice of τ and S
As mentioned earlier, larger values of τ lead to faster numerical convergence of the Markov chain, while smaller values lead to slow convergence.However, past a certain step size value, the chain becomes non-ergodic, and hence numerically unstable (Hodgkinson et al., 2021).Larger values of τ also increase the bias of the approximation (Vollmer et al., 2016).
In our empirical application, we find good performance with τ = 1 n , which guarantees smaller steps with more concentrated posterior distributions.
Another caveat to choosing the step size are differences in the magnitude of the marginal posterior variances across parameters.A scalar step size must be small enough to produce stable draws for the parameter with the smallest posterior variance (Girolami and Calderhead, 2011).This renders the chain to be inefficient for model parameters with large posterior variance, as it becomes slow at traversing the parameter space.To alleviate this, works such as Roberts and Stramer (2002) and Welling and Teh (2011) have suggested the use of a preconditioning matrix.We follow suit in this paper and re-express the ULA step as where U is a diagonal matrix, in which larger diagonal elements correspond to parameters expected to have a larger posterior variance.
Our empirical application demonstrates that by setting the number of draws to evaluate the gradient in HULA as S = 1, we can achieve an accurate approximation while keeping the computational costs low.Algorithm 1 outlines the HULA steps for generating from the approximation, which we denote as q τ (θ).

Algorithm 1 Hybrid Unadjusted Langevin Algorithm
1: Set a value for τ , U and initialize θ 1 2: for k = 1, . . ., K do 3: Draw z (k) ∼ p(z|y, θ k ) 4: The multinomial probit (MNP) model is an example of an econometric model for which MCMC sampling is time-consuming.Evaluation of the likelihood function involves analytical integration of a large number of latent variables, which is infeasible.Thus, MCMC samples from the augmented posterior distribution, which requires generation of all latent variables at each iteration and induces high autocorrelation in the Markov chain.In addition, the MNP specification we consider here requires the use of blocked random walk Metropolis-Hastings algorithms for sampling of the model parameters, which exacerbates the problem of high autocorrelation in the chain.
HULA speeds up estimation of the MNP model in three ways.First, because it uses the gradient of the logarithm of the augmented posterior density, its chain quickly converges towards regions of high posterior probability.Second, independent of the specification of the model, in HULA the parameters are generated all at once rather than in multiple blocks.Third, HULA allows for subsampling of the latents in each sample iteration, which leads to a substantial increase in computational efficiency.

Model specification
We observe a multinomial choice y i for individual i = 1, . . ., n, where y i = j if individual i chooses choice alternative j = 0, 1, . . ., J. Let z i = (z i1 , . . ., z iJ ) ⊤ be a J-dimensional vector of continuous random variables representing the latent utilities for the choice alternatives, which excludes the base-category latent utility z i0 = 0.
The multinomial outcome y i is determined by the maximum value of z i : where max(z i ) is the largest element of z i .The latent utilities corresponding to the choice alternatives are modeled as where X i is a J × r regressor matrix, β is an r-dimensional vector of coefficients, and ⊤ is a J-dimensional normally distributed disturbance vector with mean zero and covariance matrix Σ.We consider a factor structure for Σ.Define the J × p matrix B with p ≤ J and the J × J diagonal matrix D. We model Σ as The total number of parameters in B and D is J(p + 1).This implies that for a given value of p, the number of parameters grows linearly with J, instead of quadratically.Since the scale of the covariance matrix of the latent utilities is not identified in a multinomial probit model, we follow the approach of Loaiza-Maya and Nibbering (2022) and transform the elements of B and D into a spherical coordinate system.This system is parametrized by a (J(p + 1) − 1)-dimensional vector of angles κ, where the covariance matrix Σ = Σ(κ) is constructed from κ.
The augmented likelihood function of the model is given by where θ = (β ⊤ , κ ⊤ ) ⊤ , φ J (z i ; X i β, Σ(κ)) denotes the J-variate normal density with mean where I[A] is an indicator function that equals one if A is true and zero otherwise.
We use HULA to construct an approximation to this posterior, which requires an analytical expression for ▽ θ log p(y, z|X, θ)p(θ), a draw from the full conditional distribution of the latent utilities p(z|θ, y, X), and a choice for the preconditioning matrix U and the step size τ .First, Loaiza-Maya and Nibbering (2023) derives an expression for the gradient.Second, the latent utilities are sampled according to the Gibbs sampling algorithm proposed by Geweke (1991).Third, we set τ = 1 n , the diagonal elements of U corresponding to β equal 0.99 divided by the diagonal elements of the matrix 1 n n i=1 X ⊤ i Σ equi X i , and the diagonal elements of U corresponding to κ equal 0.1.
The HULA predictive probability mass function for y i is given by where X i denotes the attributes of the observation i to be predicted.An estimate pτ (y i |X i , y, X) for the predictive in ( 21) is constructed as the empirical probability mass implied by the draws y and {κ [k] } K k=1 denote the parameter draws from q τ (θ).

Pasta purchases
We fit the multinomial probit model to a data set on pasta brand purchases.This consumer choice data set includes one million purchases and is made available by the Dunnhumby data platform1 as "Carbo-Loading: A Relational Database".The final sample includes purchases without coupons of the ten top-selling pasta brands, excluding private labels, and contains 1,070,436 observations.We randomly allocate 80% of the observations for estimation of the model, and the remaining 20% are employed for outof-sample evaluation.The data set is described in detail by Loaiza-Maya and Nibbering (2023), who fit the multinomial probit model discussed above with an intercept and the log price for each brand with VB.We follow the same model specification and compare our method to exact MCMC and VB.

Computational efficiency
This section compares the computational efficiency of the proposed HULA method to that of the exact MCMC sampler for the multinomial probit model.First, per sample iteration, the computational cost of HULA is lower than that of MCMC: 1.64 versus 1.81 seconds, respectively2 .Both samplers generate at each iteration the latent utilities for all pasta brands and all observations from its full conditional distribution.MCMC samples the coefficients with a Gibbs step and the angles with a blocked random walk Metropolis-Hastings step.Instead, HULA produces a new of draw of all the parameters at once by evaluating the gradient of the augmented posterior, which is less time-consuming.Second, the HULA chain converges substantially faster than that of MCMC. Figure 1 shows the trace plot of the price coefficient and the first angle parameter of the first 200,000 iterations.For both parameters, the HULA chain converges well within the first 100,000 iterations.However, the MCMC chains have still not converged to the region of high posterior probability at 200,000 iterations.We find similar patterns in the chains for the other parameters: the 9 intercepts in β and the 16 other angles in κ.Based on the trace plots, we conclude that the HULA chain has converged after 100,000 iterations, and the MCMC chain after 1,000,000 iterations.
After discarding these burn-in iterations, we use the remaining iterations to construct our results.For HULA we collect 100,000 draws after the burn-in, and for MCMC 1,000,000 draws.
We assess the quality of the collected samples by the effective sample size calculated with an autocorrelation order of 1000 lags.Figure 2 shows the effective sample size per iteration of HULA relative to MCMC.The white, black and grey bars correspond to the intercept coefficients, price coefficient, and angle parameters, respectively.Since values larger than 1 favour HULA over MCMC, HULA obtains a higher effective sample size per iteration for all parameters.This result, in conjunction with the fact that HULA is also faster per iteration, and its chain reaches convergence in fewer iterations than MCMC, make it much more computationally efficient in this example.
The HULA approach takes in total 91 hours.This is still more computationally costly than the 15 hours of computation time required by VB.However, as we will show below, HULA is more accurate and the difference in computational costs can be reduced substantially by constructing the gradient using subsampling.

Posterior accuracy
In the MNP model, the parameter estimates themselves are hard to interpret and as such are not considered to be the key output from the model.Instead, in most empirical applications practitioners are interested in the implied choice probabilities of the alternatives.Thus, to assess the accuracy of the proposed HULA method, we compare their fitted posterior choice probabilities to those estimated by the exact MCMC method and the approximate VB method.
We find that HULA accurately estimates the posterior choice probabilities.Figure 3 shows the choice probability of buying the pasta brand 'Healthy Harvest' (Panel (a)) and 'De Cecco' (Panel (b)) as a function of their price, with the prices of the other pasta brands fixed at their mean.The solid yellow lines correspond to the posterior  probabilities of MCMC, the dashed black lines to HULA, and the dotted red lines to VB. HULA produces posterior probabilities that are almost identical to MCMC.We find the same for the purchase probabilities of the other 8 brands.However, Figure 3 shows that for 'Healthy Harvest' and 'De Cecco' VB is less accurate.We find that for the most popular pasta brands, the three methods have similar posterior purchase probabilities.

Predictive accuracy
HULA also attains similar predictive accuracy to MCMC.Table 1 shows the in-and out-of-sample log-scores and hit-rates of HULA, VB, and exact MCMC.Larger values for these metrics are preferred.The log-scores and hit-rates of HULA and MCMC are similar.However, although the differences are small, HULA performs better on all four metrics relative to Variational Bayes.

Hybrid Unadjusted Langevin with subsampling
The computational efficiency of HULA can be improved by constructing an unbiased estimate for the gradient at each iteration with a random subsample of observations, which implies that only a subsample of latent utilities must be generated.We run HULA with 20% subsampling, which takes 22 hours.This is a substantial reduction in computation time relative to HULA without subsampling (91 hours).Moreover, the computational cost is of the same order as that of VB (15 hours).
The loss in posterior accuracy due to subsampling is small.We find that the posterior probabilities of HULA with 20% subsampling are visually indistinguishable from those of HULA without subsampling, and are thus not included in Figure 3. Hence, both HULA with and without subsampling produce posterior purchase probabilities that are very close to the exact posterior purchase probabilities of MCMC.The predictive accuracy of the HULA with subsampling is also similar to HULA without subsampling.Table 1 shows that the in-and out-of-sample log-scores and hit-rates of HULA with 20% subsampling and MCMC are still alike, and that HULA with 20% subsampling still outperforms VB.
We conclude that the reductions in computation time due to subsampling are substantial, while the loss in accuracy is small.

Conclusion
This paper proposes an approximate Bayesian estimation method that is based on the unadjusted Langevin algorithm.This method is gaining popularity in the machine learning literature due to its simplicity and its convenient convergence properties.We show how the method can be employed to produce accurate estimation of latent variable models with large data sets, which is a challenging estimation problem in econometrics.
We refer to our method as the hybrid unadjusted Langevin algorithm (HULA).The approach targets the marginal posterior distribution of the model parameters, where an unbiased gradient estimate for the logarithm of the marginal posterior density is constructed via the use of the Fisher identity.The advantages of HULA are illustrated in a discrete choice application with big data: HULA produces fast convergence, negligible loss in posterior and predictive accuracy, and a substantial reduction in computational costs.

Figure 1 :
Figure 1: Trace plot across iterations of the HULA and exact MCMC chain.

Figure 2 :
Figure 2: Effective sample size per iteration of HULA relative to exact MCMC

Figure
Figure 3: Purchase probabilities for two pasta brands

Table 1
This table shows the in-and out-of-sample log-scores and hit-rates, defined in, for instance,Loaiza-Maya et al. (2022).The final row shows the total estimation time in hours.Predictive densities are estimated with VB, HULA with 20% subsampling, HULA with no subsampling, exact MCMC, and a naive method in which the forecast equals the most frequently observed category.