Evaluating the predictive abilities of mixed logit models with unobserved inter-and intra-individual heterogeneity

Mixed logit models with unobserved inter-and intra-individual heterogeneity hierarchically extend standard mixed logit models by allowing tastes to vary randomly both across individuals and across choice situations encountered by the same individual. Recent work advocates using these models in choice-based recommender systems under the premise that mixed logit models with unobserved inter-and intra-individual heterogeneity aﬀord personalised preference estimation and prediction. In this study, we evaluate the ability of mixed logit with unobserved inter-and intra-individual heterogeneity to produce accurate individual-level predictions of choice behaviour. Using simulated and real data, we show that mixed logit models with unobserved inter-and intra-individual heterogeneity do not provide signiﬁcant improvements in choice prediction accuracy over standard mixed logit models, which only account for inter-individual taste variation. We make these observations even in scenarios with high levels of intra-individual taste variation and when the number of choice situations per decision-maker is large. Also, the estimation of mixed logit with unobserved inter-and intra-individual heterogeneity requires at least seven times as much computation time as the estimation of standard mixed logit. Drawing from recent advances in machine learning and econometrics, we discuss alternative modelling approaches that can capture richer dependencies between decision-makers, alternatives and attributes.


Introduction
Mixed random utility models such as mixed logit (McFadden and Train, 2000) provide a powerful framework to account for unobserved taste heterogeneity in discrete choice models. When longitudinal choice data are analysed using mixed random utility models, it is standard practice to assume that tastes vary randomly across decision-makers but not across choice situations encountered by the same individual (Revelt and Train, 1998). The implicit assumption underlying this treatment of unobserved heterogeneity is that an individual's tastes are unique and stable (Stigler and Becker, 1977). However, contrasting views of preference formation postulate that preferences are constructed in an ad-hoc manner at the moment of choice (Bettman et al., 1998) or learnt and discovered through experience (Kivetz et al., 2008). From a behavioural perspective, these alternative views of preference formation justify accounting for both inter-and intra-individual random heterogeneity in discrete choice models (also see Hess and Giergiczny, 2015). A straightforward way to accommodate unobserved inter-and intra-individual heterogeneity in mixed random utility models is to augment a normal mixing distribution in a hierarchical fashion such that case-specific taste parameters are generated as normal perturbations around individual-specific taste parameters (see Becker et al., 2018, Bhat and Castelar, 2002, Bhat and Sardesai, 2006, Bhat and Sidharthan, 2011, Danaf et al., 2019, Hess and Giergiczny, 2015, Hess and Rose, 2009, Hess and Train, 2011, Xie et al., 2020, Yáñez et al., 2011. Originally, mixed logit models with unobserved inter-and intra-individual heterogeneity were primarily used as variance decomposition techniques in order to separate unobserved taste variation into inter-and intra-individual terms. Yet, recent work advocates using these methods in choice-based recommender systems under the premise that mixed logit models with unobserved inter-and intraindividual heterogeneity afford personalised preference estimation and prediction (Danaf et al., 2019, Xie et al., 2020. These studies demonstrate that mixed logit models with unobserved inter-and intra-individual heterogeneity outperform standard logit models at out-of-sample prediction, both unconditionally (i.e. non-personalised inter-individual prediction for respondents without a history of past choices) and conditionally (i.e. personalised intra-individual prediction for respondents with a history of past choices) individuals. However, these studies do not draw comparisons with standard mixed logit models, which only account for inter-individual heterogeneity. Danaf et al. (2019) and Xie et al. (2020) contrast non-personalised (unconditional) and personalised (conditional) predicted choice probabilities of mixed logit with inter-and intra-individual heterogeneity. As expected, they conclude that personalisation improves the conditional prediction accuracy. However, unconditional choice probabilities of mixed logit with inter-and intra-individual heterogeneity are not the same as conditional choice probabilities of standard mixed logit models. In this research note, we evaluate the ability of mixed logit models with unobserved inter-and intra-individual heterogeneity to provide personalised predictions of choice behaviour. Using simulated and real data, we show that mixed logit models with unobserved inter-and intra-individual heterogeneity provide only marginal gains in terms of conditional predictions over simpler, computationally less expensive mixed logit models with only inter-individual heterogeneity. In light of these findings and informed by recent advances at the intersection of machine learning and econometrics, we then discuss alternative approaches adopted in recommender systems to generate personalised predictions with random utility models. With the growing availability of dynamic panel data sets, recommender systems are increasingly employed to increase user satisfaction and lower search costs by helping users to navigate complex goods and service systems such as Internet marketplaces and smart mobility (Ansari et al., 2000, Lu et al., 2015. An example of a recommender system is a mobile local search-and-discovery application that provides users with personalised recommendations of places like restaurants based on user characteristics and previous visits (Kim, 2015). Accurate methods for personalised preference estimation and prediction lie at the heart of successful recommender systems (Ansari et al., 2000). Unlike standard recommendation methods such as collaborative and content-based filtering, discrete choice models can be employed even when the choice set is not persistent (Danaf et al., 2019). Consequently, there is a synergy between the methods adopted in recommender systems and discrete choice models, because the success of both approaches depends critically on the ability to capture rich dependencies between individuals, alternatives and attributes (Jiang et al., 2014). Several remarks about the focus of our contribution are in order: First, we emphasise that the main focus of our contribution is on evaluating the predictive abilities of mixed logit with unobserved inter-and intra-individual heterogeneity and not on understanding behaviour. At the same time, our analysis includes comparisons of maximum simulated likelihood and Bayes estimators for mixed logit models with unobserved inter-and intra-individual heterogeneity. These comparisons are also relevant to researchers who are mainly interested in using the model to explain behaviour. Furthermore, we discuss several emerging approaches from the recommender systems literature which can be incorporated into the random utility maximisation framework in order to improve the conditional prediction accuracy of discrete choice models. This research direction is timely and relevant because the methods adopted in recommender systems offer flexible, parametric representations of the dependencies between individuals, alternatives and attributes. Until recently, these innovative models were expensive to estimate using standard methods due to their large parameter spaces. However, emerging approximate inference methods such as variational inference offer a drastic reduction of the computational burden associated with the estimation of such complex probabilistic models (Bansal et al., 2020, Hosseini et al., 2018. We organise the remainder of this research note as follows. First, we introduce mixed logit with unobserved inter-and intra-individual heterogeneity (Section 2). Next, we present the simulation study and the real data application (Sections 3 and 4). Then, we provide an extended discussion of alternative modelling approaches (Section 5), and finally, we conclude (Section 6).

Model formulation
Mixed logit with unobserved inter-and intra-individual heterogeneity (in particular Rose, 2009, Hess andTrain, 2011) is established as follows: In choice situation t ∈ {1, . . . T }, a decision-maker n ∈ {1, . . . N} derives utility from alternative j in the set C = {1, . . . , J}. Here, V() denotes the deterministic aspect of utility, X ntj is a vector of covariates, β nt is a collection of taste parameters, and ε ntj is a stochastic disturbance. We obtain the logit model under the assumption that ε ntj is independently and identically distributed according to Gumbel(0, 1) across decision-makers n, choice situations t and alternatives j.
Consequently, the probability that decision-maker n chooses alternative j ∈ C in choice situation t can be expressed as where the random variable y nt ∈ C indicates the chosen alternative. The distinguishing feature of mixed logit with unobserved inter-and intra-individual heterogeneity is that the taste parameters β nt are case-specific. More specifically, β nt is a normal perturbation around an individual-specific parameter µ n , i.e. β nt ∼ N(µ n , Σ W ) for t = 1, . . . , T , where Σ W is a full covariance matrix. The distribution of the individual-specific parameter µ n is then also multivariate normal, i.e. µ n ∼ N(ζ, Σ B ) for n = 1, . . . , N, where ζ is a mean vector and Σ B is a full covariance matrix. In contradistinction, the standard panel estimator for mixed logit assumes taste homogeneity across replications, i.e. β nt = β n ∀t ∈ {1, . . . , T }, in order to capture inter-individual taste heterogeneity and to allow for dependence across repeated observations (Revelt and Train, 1998). The generally adopted labels inter-and intra-individual heterogeneity may falsely suggest that inferences are performed at the individual level. However, this is not the case. Compared to standard mixed logit, mixed logit with inter-and intra-individual heterogeneity has one more level to capture taste variation across choice situations. We learn about taste variation at the inter-individual level using information about differences across respondents in a longitudinal dataset. However, since Σ W is generic, we learn about taste variation at the intra-individual level using information about differences across all choices from all respondents rather than across choices from only one respondent.

Estimation
Mixed logit with unobserved inter-and intra-individual heterogeneity can be estimated using either classical maximum simulated likelihood (MSL) or Bayesian Markov chain Monte Carlo (MCMC) methods. In what follows, we describe both estimation approaches.

Maximum simulated likelihood (MSL)
In MSL estimation, the parameters θ = {ζ, Σ B , Σ W } are treated as fixed, unknown quantities. Point estimates of θ are obtained via maximisation of the unconditional log-likelihood, whereby the optimisation is in fact performed with respect to the Cholesky factors {L B , L W } of {Σ B , Σ W } in order to maintain positive-definiteness of the covariance matrices. Unlike in Bayesian estimation, the stochastic parameters µ n and β nt are not directly estimated, because they are integrated out in the simulation of the unconditional log-likelihood.
We then obtain the unconditional log-likelihood by marginalising out the stochastic parameters µ n and β nt . We have Since the integrals in (3) are not analytically tractable, we resort to simulation to approximate the log-likelihood. The simulated log-likelihood is given by where we define β nt,dr = ζ + L B ξ n,d + L W ξ nt,r . Here, ξ n,d and ξ nt,r denote standard normal simulation draws. For each decision-maker, we take D draws to marginalise out µ n and R draws to marginalise out γ nt . A point estimate θ is then given by The optimisation problem defined in (5) can be solved using quasi-Newton methods, which exploit the gradient of the objective function to find a local optimum.
Numerical gradient approximations are computationally expensive, as they incur many evaluations of the objective function. However, computation times of quasi-Newton methods can be drastically reduced, if analytical gradients of the objective are provided. In the case of mixed logit with unobserved inter-and intra-individual heterogeneity, the two levels of integration in the approximation of the unconditional log-likelihood impose a substantial computational burden. Consequently, efficient optimisation routines are critical for moderating estimation times. In Appendix A.1, we present the analytical gradient of (4). After computing the point estimate θ = { ζ, Σ B , Σ W }, we can obtain the posterior distribution of µ n using Bayes theorem Train, 2000, Train, 2009). We have P(µ n |y n , X n , θ) = P(y n |X n , µ n , The mean of this posterior distribution is given byμ n = µ n P(µ n |y n , X n , θ)dµ n . Consequently, we havě µ n = µ n P(y n |X n , µ n , Since the integrals in (7) are not analytically tractable, we resort to simulation to approximate the posterior mean. The simulated posterior mean µ n is given by with

Markov chain Monte Carlo (MCMC)
The goal of Bayesian estimation is to infer the posterior distribution of all model parameters {ζ, Σ B , Σ W , µ, β}. Thus, unlike in MSL estimation, posterior samples of µ n and β nt are directly obtained along with posterior samples of the other parameters. The Bayesian approach entails the specification of full probability model for all parameters. Therefore, we also need to assign priors to {ζ, Σ B , Σ W }. We use a vague normal prior for ζ, i.e. ζ ∼ N(λ 0 , Λ 0 ), a half-t prior for Σ B , Σ W . The latter is selected because of its superior non-informativity properties compared to alternative prior specifications for covariance matrices (Akinc andVandebroek, 2018, Huang andWand, 2013). The half-t prior is defined hierarchically: It consists of an inverse Wishart prior for Σ with Σ ∼ IW(ν + K − 1, 2ν∆), where ν is a known hyper-parameter and K denotes the number of random parameters. ∆ = diag(δ 1 , . . . , δ K ) is a diagonal matrix with elements δ k distributed Gamma 1 2 , 1 a 2 k . Stated succinctly, the full generative process of mixed logit with unobserved interand intra-individual heterogeneity is as follows: where {λ 0 , Λ 0 , ν B , ν W , a B , a W } are known hyper-parameters, and θ = {δ B , δ W , Σ B , Σ W , ζ, µ, β} are the model parameters whose posterior distribution we wish to estimate. The generative process given in (10)-(17) implies the following joint distribution: where B,k and r W,k = a −2 W,k . By Bayes' rule, the posterior distribution of interest is given by P(θ|y) = P(y, θ) P(y, θ)dθ ∝ P(y, θ).
Exact inference of this posterior distribution is not possible, because the model evidence P(y, θ)dθ is not analytically tractable. Hence, we resort to approximate inference.
The central idea of MCMC is to approximate a posterior distribution through samples from a Markov chain whose stationary distribution is the target distribution.
Gibbs sampling constructs such a Markov chain by iteratively sampling from the conditional posterior distributions of blocks of model parameters. Becker et al. (2018) devise a Gibbs sampler for posterior inference in mixed logit with unobserved inter-and intra-individual heterogeneity. In Appendix A.2, we present one iteration of the sampler. A key feature of the algorithm is that posterior samples of µ n are directly obtained.

Simulation study
In this section, we present an extensive simulation evaluation of mixed logit with unobserved inter-and intra-individual heterogeneity. The model is benchmarked against simpler conditional and mixed logit models in terms of estimation time, estimation accuracy and out-of-sample predictive accuracy. In addition, we compare the performance of the MSL and MCMC estimators for mixed logit with unobserved inter-and intra-individual heterogeneity.

Data and experimental setup
For the simulation study, we rely on synthetic choice data, which we generate as follows: The choice sets comprise three unlabelled alternatives, which are characterised by four attributes. Decision-makers are assumed to be utility maximisers and to evaluate the alternatives based on the utility specification For the generation of the taste parameters β nt , we consider two scenarios, in which the proportion of the total variance that is due to intra-individual taste heterogeneity is varied. In the two scenarios, β nt is drawn via the following process: where The assumed values of ζ, Ω B and Ω W are enumerated in Appendix B. We define σ 2 B = 2 · (1 − α) · |ζ| and σ 2 W = 2 · α · |ζ| with α ∈ [0, 1], i.e. the total variance of each random parameter is twice the absolute value of its mean, and a proportion α of the total variance is due to intra-individual taste variation. In scenario 1, we set α = 0.3, and in scenario 2, we set α = 0.7. In both scenarios, the alternativespecific attributes X ntj are drawn from Uniform(0, 2), which implies an error rate of approximately 20%, i.e. in one fifth of the cases decision-makers deviate from the systematically best alternative due to the stochastic utility component. In both scenarios, we further set N = 1000 and let T take a value in {10, 20}. For each experimental scenario and for each value of T , we consider 20 replications, whereby the data for each replication are generated using a different random seed.

Accuracy assessment
We evaluate the accuracy of the estimation approaches in terms of their ability to recover parameters in finite samples and their out-of-sample predictive accuracy.

Parameter recovery
To assess how well the estimation approaches perform at recovering parameters, we calculate the root mean square error (RMSE) for selected parameters, namely the mean vector ζ and the unique elements {Σ B,U , Σ W,U } of the covariance matrices {Σ B , Σ W }. Given a collection of parameters θ and its estimate θ, RMSE is defined where M denotes the total number of scalar parameters collected in θ. For MSL, point estimates of ζ, Σ B and Σ W are directly obtained. For MCMC, estimates of the parameters of interest are given by the means of the respective posterior draws. As our aim is to evaluate how well the estimation methods perform at recovering the distributions of the realised individual-and observation-specific parameters {µ, β}, we use the sample mean ζ = 1 N N n=1 µ n and the sample covariancesΣ B = 1 N N n=1 (µ n −ζ)(µ n −ζ) and

Predictive accuracy
We consider two out-of-sample prediction scenarios. In the first scenario, we predict choice probabilities for a new set of individuals without a history of past choices, i.e. we predict unconditionally on an individual's past choices. To that end, we generate a test set consisting of 100 observations from 100 new individuals along with each training sample. The realised choices and attributes of this sample are denoted by y * nt and X * nt . In the second scenario, we predict choice probabilities for new choice sets for individuals who are already in the training sample and thus have a record of past choices, i.e. we predict conditionally on an individual's past choices. To that end, we create another test set by generating additional choice sets for 100 individuals from the training sample. The realised choices and attributes of this sample are denoted by y † nt and X † nt . We let T = 1 in both validation samples. For mixed logit with unobserved inter-and intra-individual heterogeneity, the estimated predicted choice probabilities for the unconditional prediction scenario are given by (23) where ζ, Σ B and Σ W denote the posterior means of ζ, Σ B and Σ W , respectively. The estimated predicted choice probabilities for the conditional prediction scenario are given by where µ n and Σ W denote the posterior means of µ n and Σ W , respectively. Recall that in MCMC, the posterior distribution of µ n is directly estimated, whereas for MSL, we obtain µ n using (8). Expressions for the estimated predicted choice probabilities for standard logit and mixed logit with only inter-individual heterogeneity can be obtained by omitting levels of integration from (23) and (24). For each of the two prediction scenarios, we calculate Brier scores (Brier, 1950) with respect to the realised choices and the predicted choice probabilities. The Brier score (BS) of a test set is given by where 1{y nt = j} is an indicator, which equals one if the condition inside the braces is true and zero otherwise. P ntj is a shorthand notation for the predicted probability that y nt = j is observed. A lower Brier score indicates superior predictive accuracy. The Brier score is a strictly proper scoring rule, since it is exclusively minimised by the true predictive choice probabilities (Gneiting and Raftery, 2007). An important feature of the Brier score is that it takes into account the predicted choice probabilities of whole choice sets. Danaf et al. (2019) and Xie et al. (2020) use the average of the predicted probabilities of only the chosen alternatives (henceforth, P chosen ) to evaluate predictive accuracy, with the interpretation that a higher value of P chosen indicates superior predictive performance. In what follows, we report both Brier scores and P chosen .

Implementation details
We implement the MSL and MCMC estimators in Python. 1 For MSL, the numerical optimisations are performed using the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm (Nocedal and Wright, 2006) contained in Python's SciPy library (Jones et al., 2001). Analytical gradients are provided (see Appendix A.1). The Hessian matrix of the simulated log-likelihood function is calculated as a finite difference approximation of the Jacobian of the analytical gradient. We use 250 inter-individual simulation draws per decision-maker and 250 intra-individual simulation draws per observation. To establish that the estimation results are stable for these numbers of draws, we also evaluate the performance of the MSL estimator using 500 × 500 draws. The simulation draws are generated using the Modified Latin Hypercube sampling (MLHS) approach (Hess et al., 2006). We also take advantage of Python's parallel processing capacities to improve the computational efficiency of the MSL estimator. We process the likelihood computations in ten parallel batches, each of which corresponds to 25 (50) inter-individual simulation draws. The MCMC sampler for mixed logit with unobserved inter-and intra-individual heterogeneity is executed with two parallel Markov chains and 400,000 iterations for each chain, whereby the initial 200,000 iterations of each chain are discarded for burn-in. After burn-in, every tenth draw is retained to moderate storage requirements and to facilitate post-simulation computations. For standard logit and mixed logit with only inter-individual heterogeneity, the MCMC samplers are executed with two parallel Markov chains and 100,000 iterations for each chain, whereby the initial 50,000 iterations of each chain are discarded for burn-in. After burn-in, every fifth draw is kept. Table 1 compares the predictive accuracy of the models. For each value of α and number of choice situations per individual T , we report the means and the standard errors of the Brier scores as well as the average predicted probabilities of the chosen alternative (P chosen ) for the unconditional and the conditional prediction scenarios across 20 resamples. In our subsequent discussion, we focus on the Brier score, as it is strictly proper. Nonetheless, P chosen leads to the same general conclusions. Across the different experimental scenarios, we do not observe significant differences in unconditional predictive accuracy between the considered methods. As expected, standard logit without individual-specific parameters yields the same level of predictive accuracy in the unconditional and the conditional prediction scenarios. However, due to the presence of individual-specific parameters, mixed logit provides better conditional predictive accuracy than standard logit. For instance, in scenario 1 with α = 0.3 for T = 20, MNL produces an average Brier score of 0.200. Mixed logit with only inter-individual heterogeneity produces an average Brier score of 0.152, whereas mixed logit with unobserved inter-and intraindividual heterogeneity estimated via MCMC and MSL with 250×250 draws give average Brier scores of 0.149 and 0.150, respectively. We further observe that the conditional predictive accuracy of mixed logit improves relative to standard logit, as more choice situations are included in the estimation. For example, in scenario 1 with α = 0.3, the Brier score of mixed logit with only inter-individual heterogeneity is 0.165 for T = 10, while it is 0.152 for T = 20. Interestingly, mixed logit with unobserved inter-and intra-individual heterogene-ity does not provide significantly better conditional predictive accuracy than standard mixed logit in any of the considered experimental scenarios. The difference in Brier scores of the two methods is at most 0.003. Also, the proportion of variance α that is due to intra-individual taste variation does not appear to affect the conditional predictive accuracy of considered mixed logit models. For example, for T = 20, the average Brier score for the conditional prediction scenario of mixed logit with unobserved inter-and intra-individual heterogeneity estimated via MCMC is 0.149 and 0.150 in both scenario 1 (α = 0.3) and scenario 2 (α = 0.7). Even in scenario 2, in which intra-individual taste variation accounts for 70% of the total variation in tastes, mixed logit with unobserved inter-and intra-individual heterogeneity does not outperform simple mixed logit with only inter-individual heterogeneity.

Results
Another insight from Table 1 is that the MCMC estimator and the two configurations of the MSL estimator for mixed logit with unobserved inter-and intraindividual heterogeneity perform equally well in the considered prediction scenarios. Table 2 contrasts the estimation accuracy of MCMC estimator and the two configurations of the MSL estimator for mixed logit with unobserved inter-and intraindividual heterogeneity. We find that the three methods perform equally well at recovering parameters. We further observe negligible differences between MSL with 250 × 250 draws and MSL with 500 × 500 draws. Finally, Table 3 gives the estimation times of the different methods across the considered experimental scenarios. Mixed logit with only inter-individual heterogeneity is substantially faster than mixed logit with unobserved inter-and intra-individual heterogeneity. In all of the considered simulation scenarios, MSL with analytical gradients and 250 × 250 simulation draws is faster than MCMC. For example, in scenario 1 for T = 10, the average estimation time of simple mixed logit is 285 seconds, while the average computation times of mixed logit with unobserved inter-and intra-individual heterogeneity estimated via MCMC and MSL with 250 × 250 simulation draws are approximately tenfold with 3,423 seconds and 2,951 seconds, respectively. In the considered scenarios, MSL with 250 × 250 draws is approximately four times faster than MSL with 500 × 500 draws. We also observe that the standard errors of the estimation times across the 20 resamples are proportionally higher for MSL than for MCMC. Inherent differences between the estimation algorithms explain this discrepancy. Whereas MCMC simulations are run for a fixed number of iterations, the number of function evaluations that is needed to reach convergence during the maximisation of the simulated log-likelihood is not fixed and depends on initial values and the shape of the log-likelihood surface.
In sum, the simulation study shows that none of the mixed logit models provide substantially better unconditional predictive accuracy than standard logit. Nonetheless, the considered mixed logit models offer superior conditional pre-dictive accuracy. Yet, there are no substantive differences between standard mixed logit and a more complex mixed logit accounting for both inter-and intraindividual heterogeneity. Besides, we observe that MCMC and MSL with 250×250 draws are equivalent in terms of prediction and estimation accuracy. MSL with 250×250 draws is approximately four times faster than MSL with 500×500 draws, while offering equivalent estimation and prediction accuracy.

Real data application
In this section, we evaluate the performance of mixed logit with unobserved interand intra-individual heterogeneity using real data.

Data, utility specification and implementation details
Data for the empirical application are sourced from a stated preference survey about mobility on-demand in New York City Daziano, 2018, Liu et al., 2018). The data include observations from 1,507 respondents who each completed seven choice situations derived from a pivot-efficient design. Each choice situation included three labelled alternatives, namely Uber (without pooling), UberPool (with pooling) and the current mode. The alternatives are described by six attributes, namely out-of-vehicle travel time (OVTT), in-vehicle travel time (IVTT), trip cost, parking cost, the powertrain of the vehicle (gas/petrol or electric) and the automation level of the vehicle (with or without driver). Figure 1 shows an example of a choice situation.  Mixed logit with unobserved inter-and intra-individual heterogeneity assumes a utility specification of the following form: Here, X random ntj is a vector of attributes with individual-and observation-specific random taste parameters β nt , and X fixed ntj is a vector of attributes with fixed taste parameters γ. ε ntj is a stochastic disturbance with distribution Gumbel(0, 1). We performed an extensive specification search to determine which attributes to associate with either random or fixed parameters. During the specification search, we monitored model tractability and the inferred amounts of intra-individual taste variation. In the final model specification, we include three random parameters and four fixed parameters in the model. The random parameters pertain to OVTT, IVTT and a dummy variable indicating whether the hypothetical mobility ondemand vehicle is automated. The fixed parameters pertain to two alternativespecific constants for the hypothetical mobility on-demand alternatives, a dummy variable indicating whether the hypothetical mobility on-demand vehicle is electric and the total trip cost which subsumes the trip cost and the parking cost. The dummy variables are effects-coded with negative one indicating that the feature is absent and positive one indicating that the feature is present to reduce the scale of the associated parameters. OVTT and IVTT are divided by ten to increase the scale of the associated parameters. The utilities of standard logit and mixed logit with only inter-and intra-individual heterogeneity are specified analogously. We consider two configurations of the training and the test data to evaluate the influence of including different numbers of choice situations per individual in the training data on the predictive accuracy. In the first configuration, the training set includes four randomly selected choice situations from 1,407 randomly selected respondents. One test set is used to evaluate the unconditional predictive ability of the considered models. It includes one choice situation from each of the remaining 100 respondents. A second test set is used to evaluate the conditional predictive ability. It is formed by randomly selecting one of the remaining choice situations from the 1,407 respondents included in the training sample. In the second configuration, the training set includes six randomly selected choice situations from 1,407 randomly selected respondents. The test sets are created in the same way as in the first configuration. For each configuration, we create ten random splits into training and test sets. We then compare the performance of the different choice models across the splits. The MCMC methods are estimated in the same way as described in Section 3.3. For MSL, we use 250 inter-individual simulation draws per decision-maker and 250 intra-individual simulation draws per observation. We also tested the MSL method with 500 × 500 draws but found no differences in prediction accuracy and estimation results. Table 4 contrasts the predictive accuracy of the considered models. For each configuration of the training data with T = 4 or T = 6 and for each model, we report the means and the standard errors of the Brier scores and the average predicted choice probabilities of the chosen alternative for the unconditional and the conditional prediction scenarios across ten random splits of the training data. Overall, the results are consistent with the results of the simulation study. We do not observe any noteworthy differences in unconditional predictive accuracy across methods and configurations of the training data. Both mixed logit models offer better conditional prediction accuracy than the standard multinomial logit model. The more complex mixed logit model with unobserved inter-and intra-individual heterogeneity does not provide benefits over standard mixed logit in terms of conditional prediction accuracy. For both types of mixed logit, the conditional prediction accuracy increases as more choice situations are included in the training data. For example, for T = 4, standard mixed logit produces an average Brier score of 0.154 in the conditional prediction scenarios. The same model yields an average Brier score of 0.147 in the conditional prediction scenarios with T = 6. In both configurations of the training data, the MCMC and MSL estimators for mixed logit with unobserved inter-and intra-individual heterogeneity are equally accurate in the two prediction scenarios. The reported values are averages and standard errors across ten random splits. T = observations per individual. Brier = Brier score. P chosen = average predicted probability of chosen alternative. UC = unconditional prediction, C = conditional prediction. Table 4: Predictive accuracy on real data Table 5 enumerates the detailed estimation results for one of the random splits of the stated choice data with six choice situations per individual. For MCMC, we report the posterior means, the posterior standard deviations and the bounds of the 95% credible interval. For MSL, we report the point estimates, the asymptotic standard errors and the bounds of the 95% confidence intervals. Recall that for MSL, the maximisation of the simulated log-likelihood is performed with respect to the Cholesky factors of the covariance matrices of the heterogeneity distributions. Thus, the reported estimates of the covariance elements reported are derived from the point estimates of the Cholesky factors. Standard errors of the covariance elements are obtained using a parametric bootstrap with 10,000 draws. Our first observation is that in the majority of cases, the fixed taste parameters and means of the random taste parameters have the same signs in the four models. All four models suggest that the fixed taste parameter with respect to vehicle electrification is not statistically different. In the case of the mean of the random taste parameter pertaining to OVTT, we observe that the parameter is negative and statistically different from zero in standard mixed logit, while the parameter is not statistically different from zero in both classical and Bayesian mixed logit with unobserved inter-and intra-individual heterogeneity. Furthermore, we find that the MSL and MCMC estimates, including the credible (confidence) intervals, of mixed logit with unobserved inter-and intra-individual heterogeneity exhibit a close correspondence. Due to its ability to decompose taste variation into inter-and intra-individual components, mixed logit with unobserved inter-and intra-individual heterogeneity offers interesting behavioural insights into the sources of taste variation. We find evidence of substantial intra-individual taste variation. For example, MCMC suggests that 2.169/(3.782 + 2.169) = 36.4% of the variation in tastes with respect to OVTT are due to intra-individual heterogeneity. Similarly, MSL indicates that 0.644/(1.260 + 0.644) = 33.8% of the variation in tastes with respect to vehicle automation can be ascribed to intra-individual heterogeneity. Both MCMC and MSL suggest that all off-diagonal elements of the inter-individual covariance of mixed logit with unobserved inter-and intra-individual heterogeneity are statistically different from zero. Also, both MCMC and MSL suggest that with the exception of the covariance between the random parameters pertaining to vehicle automation and IVTT, all off-diagonal elements of the intra-individual covariance of mixed logit with unobserved inter-and intra-individual heterogeneity are statistically different from zero. Finally, Table 6 gives the estimation times of the models across the ten random splits for the two configurations of the training data. We observe that mixed logit with only inter-individual heterogeneity is substantially faster than mixed logit with unobserved inter-and intra-individual heterogeneity. For example, for T = 4, the average estimation time of standard mixed logit estimated via MCMC is approximately seven times lower than the average estimation time of mixed logit with unobserved inter-and intra-individual heterogeneity estimated via MCMC and approximately eleven times lower than the average estimation time of mixed logit with unobserved inter-and intra-individual heterogeneity estimated via MSL.

MXL-inter-intra (MSL)
Parameter Est.   Note: The dummy variables electric and automated are effects-coded with negative one indicating that the feature is absent and positive one indicating that the feature is present to reduce the scale of the associated parameters. OVTT and IVTT are divided by ten to increase the scale of the associated parameters.

Extended discussion
Our analysis suggests that mixed logit models with unobserved inter-and intraindividual heterogeneity do not provide significant improvements over simpler mixed logit models which only account for unobserved inter-individual heterogeneity in terms of conditional prediction accuracy. The inability of the former to outperform the latter can be ascribed to the former's predominant emphasis on nonstructural random heterogeneity. Thus, there is a need to explore alternative modelling approaches which have the potential to provide accurate individualised predictions of choice behaviour by accounting for richer dependencies between products and consumers' preferences as well as temporal correlations between choices in a flexible framework. In what follows, we discuss four strands of the literature and evaluate their relevance in creating choice-based recommender systems within the random utility maximisation (RUM) framework.

Collaborative filtering
Various collaborative filtering approaches such as matrix factorisation have emerged as powerful tools to generate personalised recommendations in recommender systems (Gopalan et al., 2013, Koren et al., 2009, Mnih and Salakhutdinov, 2008. The fundamental idea of collaborative filtering is to predict a consumer's preferences by exploiting interdependencies between products. Matrix factorisation provides a mapping of consumers and products into a joint latent factor space. Matrix factorisation consists of learning a sparse matrix of dimension # of consumers × # of products. Each cell of this matrix represents one consumer's preference for a specific product. The consumer's preference is represented as the inner product of a latent vector of product characteristics and a latent vector of consumer preferences for each of the latent product characteristics (see Gopalan et al., 2013, for details of the formulation). Learning such a sparse matrix is computationally challenging, but advancements in variational Bayes have made the estimation of these models tractable for large data sets. Recent studies on matrix factorisation methods also account for dynamic consumer preferences and social network effects (Hosseini et al., 2018). A combination of scalability, ability to account for dynamics and social aspects, and superior predictive accuracy have made matrix factorisation methods popular in industrial applications. However, they have received limited attention of applied econometrics and marketing communities due to their i) predominant focus on prediction rather than inference, ii) apparent disconnection to economic theory, iii) inability to model time-varying choice sets and product-specific attributes. Economists and machine learning researchers recently joined forces to address the second and third limitations of this powerful tool. Athey et al. (2018) illustrate how matrix factorisation methods can be integrated into standard RUM frameworks to predict an individual's choice of restaurants using data from local search-and-discovery application. The main idea of the approach is to augment the original utility equation with the consumer-and product-level covariates by including a vector of latent characteristics for each restaurant as well as latent preferences of consumers for these characteristics. Thus, the framework incorporates the key component (i.e., sparse latent construct) of standard matrix factorisation models in the RUM framework and adopts variational Bayes for scalable estimation and prediction. In another study, Donnelly et al. (2019) use a similar framework to model consumer preferences across multiple categories of products in a supermarket. These theorydriven advancements would hopefully convince applied choice modellers about the benefits of matrix factorisation methods for personalised predictions. Zhu et al. (2020) propose a choice model with time-varying parameters in a collaborative learning framework. Similar to latent class models, this model assumes that there are several unique underlying preference patterns (i.e., classes), but rather than assigning each consumer to one class and assuming preferences of all class members to be the same, a vector of weights (membership vector) is specified to represent the degree of resemblance of the consumer's preferences to each preference pattern. Temporal variation in these unique preference patterns is captured by time-varying model parameters. The framework is already a viable alternative to mixed logit with inter-and intra-heterogeneity. Yet, it be can further improved by taking inspiration from Athey et al. (2018) and by incorporating the latent structure of matrix factorisation into the utility equation.

Amortised variational inference
Recent application of amortised variational inference (AVI) in the estimation of the mixed logit model also offers possibilities to improve the choice prediction accuracy (Rodrigues, 2020). Instead of introducing consumer-level local variational parameters for random parameters, AVI maps observed choices and covariates with corresponding variational parameters using a deep neural network to avoid the growth of variational parameters with the sample size. AVI thus includes a generic inference network that takes a consumer's data as input and provides the approximate posterior distribution of her random taste parameters as output. In other words, AVI provides a trained inference network as a byproduct of the estimation, which can be used to obtain the posterior distribution of random taste parameters of a new consumer or the existing consumer in a new choice situation (Rodrigues, 2020). AVI has the potential to become a workhorse method in online learning applications due to its fast estimation with stochastic backpropagation and GPUaccelerated computations. AVI performs well in the initial experiments presented in Rodrigues (2020), but its performance needs to be benchmarked against other competing methods.

Neural network and tree-based models
To leverage benefits of machine learning advancements in discrete choice models without compromising at interpretability and economic theory, recent RUM based choice models have adopted variants of neural networks (Sifringer et al., 2020, Wang et al., 2020 and regression trees (Kindo et al., 2016) to specify semi-and non-parametric utility functions. These advanced models claim to improve the prediction accuracy of discrete choice models in validation samples, but they have limited focus on improving within individual predictions, i.e. predicting choice of a consumer from training dataset in a new choice situation. Bringing this additional feature in these data-theory-driven models can make them viable for online recommender systems.

Conclusion
In this research note, we evaluate the ability of mixed logit models with unobserved inter-and intra-individual heterogeneity to generate individual-level predictions. Using simulated and real data, we demonstrate that mixed logit with unobserved inter-and intra-individual heterogeneity does not provide significant improvements over standard mixed logit models which only account for interindividual taste variation. This observation persists even in scenarios which are characterised by high levels of intra-individual taste variation and when the number of choice situations per individual is large. Besides, the estimation of mixed logit with unobserved inter-and intra-individual heterogeneity demands at least seven times as much computation time as the estimation of standard mixed logit. For mixed logit with unobserved inter-and intra-individual heterogeneity, we also find that the maximum simulated likelihood (MSL) estimator with analytical gradients is faster or not substantially slower then the Bayesian Markov chain Monte Carlo (MCMC) method, which stands in contrast to previous studies which used MSL with numerical gradients (see Becker et al., 2018). We ascribe the inability of mixed logit with unobserved inter-and intra-individual heterogeneity to outperform standard mixed logit to the former's predominant emphasis on nonstructural random heterogeneity. In light of recent advances at the intersection of machine learning and econometrics, we discussed several promising alternative modelling approaches, which may offer superior prediction performance by flexibly capturing dependencies between decision-makers, alternatives and attributes. Regardless of our findings, we argue that mixed logit with unobserved inter-and intra-individual heterogeneity has a place in the literature as a variance decomposition technique and for understanding behaviour, even though the model does not offer substantially better predictive accuracy than standard mixed logit models which only account for inter-individual heterogeneity. Whilst extant applications of the model focus on decomposing taste variation into inter-and intra-individual terms, future applications of the model may benefit from reconceptualising the model as an instance of a hierarchical model (Gelman and Hill, 2006) and may examine less granular nesting structures of observational units. In particular, revealed preference data inherently provide many meaningful ways to organise observational units into nests. For example, in a longitudinal study of the career choices of high school graduates, the observational units are naturally nested within schools, districts, graduation year etc. Our analysis also includes comparisons of the unconditional out-of-sample predictive abilities of standard logit, standard mixed logit and mixed logit with unobserved inter-and intra-individual heterogeneity. In both the simulation study and the real data application, we find that the two types of mixed logit do not offer substantial improvements in unconditional predictive accuracy over standard logit. These observations are consistent with the literature (Cherchi and Cirillo, 2010, Wang et al., 2021 and suggest that while mixed logit is useful for explaining behaviour, as it can accommodate unobserved taste heterogeneity, unrestricted substitution patterns and correlation in unobservables over time, mixed logit may not necessarily provide more accurate unconditional outof-sample predictions than standard logit. Finally, we note that the out-of-sample prediction scenarios considered in this research note correspond to cases of internal validation (see Parady et al., 2021). It would be instructive to assess to what extent our findings generalise to cases of external validation, i.e. out-of-sample prediction using data from a different time period or region or data gathered via a different method (see Parady et al., 2021).

A.1 Gradient of simulated log-likelihood
In what follows, we derive expressions for the gradients of the simulated loglikelihood of mixed logit model with unobserved inter-and intra-individual heterogeneity. First, we let ϑ i denote one of the model parameters collected in θ. We have To find the derivative in the numerator, we define ψ nt,d (θ) = 1 R R r=1 P(y nt |X nt , β nt,dr ) with ψ nt,d (θ) = ∂ψ nt,dr (Θ) ∂ϑ i = 1 R R r=1 P(y nt |X nt , β nt,dr ) ∂V(X ntj , β nt,dr ) ∂ϑ i − j ∈C:j =y nt P(y nt |X nt , β nt,dr )P(j |X nt , β nt,dr ) ∂V(X ntj , β nt,dr ) ∂ϑ i .
ρ is a step size, which needs to be tuned. We employ the same tuning mechanism as Train (2009): ρ is set to an initial value of 0.1 and after each iteration, ρ is decreased by 0.001, if the average acceptance rate across all decision-makers is less than 0.3; ρ is increased by 0.001, if the average acceptance rate across all decision-makers is more than 0.3.