Estimating GEV models with censored data☆
Introduction
Current estimation methods for discrete choice models generally assume that all alternatives are observed to have been chosen for at least some observations in the estimation dataset. The simplest estimators are derived from an assumption that the sample of observations represents a purely random selection of possible observations (i.e., the population). However, using a purely random sample is not always desirable or possible. For example, using a purely random sample of travelers may be undesirable when modeling use of low ridership modes, because a prohibitively large sample would be required to ensure that sufficient quantities of users are sampled to accurately model their preferences. Moreover, in oligopolistic markets, it may even be deemed illegal collusion for competitors to share data about customers. The latter motivates the goal of this paper: to develop estimators for discrete choice models in which one or more alternatives is never observed to have been chosen in the estimation dataset; however, information about the censored alternatives is still available. This problem can be viewed as an extreme case of non-random sampling for the estimation data.
Accommodating stratified samples, where the selection of observations in the sample is not purely random, can be roughly divided into two categories: exogenous samples, where the probability of an observation being sampled is related to some attributes of the alternatives or the decision makers but unrelated to the observed choice; and endogenous samples, where the probability of an observation being sampled is related directly to the observed choice. This second case is often called “choice-based” stratified sampling.
A number of modifications to the basic maximum likelihood estimation procedure have been proposed in the literature to accommodate choice-based stratified samples, both for situations where the market shares for the various alternatives are known, and for situations where they are not known. When shares are known, Manski and Lerman (1977) showed it is possible to employ weighted exogenous sample maximum likelihood (WESML), which provides consistent estimators. However, WESML can incur a substantial loss in estimator efficiency (i.e., the standard errors of the estimates are large), notably when the variance in the weights on the observations is large. It is also possible to estimate parameters with choice-based sampling by using conditional maximum likelihood (CML), proposed by Manski and McFadden (1981). The CML methodology can even be used when market shares are unknown and must be estimated alongside the other model parameters (Hsieh et al., 1985). In the case of a multinomial logit (MNL) model with a full set of alternative specific constants, when the market shares are known the CML method reduces to ESML with post-hoc adjustments to the estimated constants. But when market shares and relative sample rates for the various alternatives are not known, the independence of irrelevant alternatives (IIA) property of the MNL model ensures that while consistent estimators for other parameters are available, the true market shares of the alternatives are unidentifiable.
We examine an extreme form of choice-based sampling: instances where one or more of the alternatives is systematically excluded from the sample used to estimate parameters, i.e. the sampling probability for those alternatives is zero. We term this condition “censored” sampling, the resulting sample as “censored data”, and the alternative[s] that have zero sampling frequency as “censored alternatives”. Under more typical choice based sampling conditions, the probability of individual decision makers being included in the sample is a function of the observed choice, and while some choices result in a smaller probability of being included than others, all decision makers have a non-zero chance of being sampled, and all possible choices are ultimately represented in the sample. With censored data, this is not the case.
Censored data, as we define it here, does not mean that no information about the censored alternatives is available or collected. It merely means that decision makers who choose a censored alternative are never sampled. Importantly, when a decision maker who selects one of the other (uncensored) alternatives is sampled, it is still possible to observe or construct the attributes of both the chosen and non-chosen alternatives, including the censored alternatives. For example, if automobile users are censored in a mode choice model, that means that no automobile users appear in the sample, but the hypothetical travel times and costs for automobile travel can still be computed for users of other modes of travel. This is not substantially different than would be necessary for those observations even if auto users were not censored.
This type of censored data can arise in a variety of contexts. In the transportation planning context, censored data might arise from data collection constraints, such as limited funding or an oversight in survey design. For example, a travel survey might have been conducted which ignored bicycle users, but during subsequent modeling applications policy makers might suddenly feel that bicycling is related to their policy goals and that it should be included in the models. Censoring is particularly common in revenue management contexts, where a firm in a competitive marketplace is attempting to set price and availability of products so as to maximize profit. In that case, the data can be censored because observations of purchase decisions of the firm’s own products are readily available, but observations of purchases of competitors’ products are not, and for competitive or legal reasons that information may never be available. Moreover, some potential customers may choose to not purchase any product at all this choice is referred to as the “no purchase” alternative, or the “outside good”.
Recent works in revenue management (Talluri and van Ryzin, 2004, Vulcano et al., 2010, Vulcano et al., 2012) and transportation planning (Newman et al., 2012, Newman et al., 2013) have examined the censored data estimation problem, and proposed methodologies to estimate discrete choice model parameters, including alternative specific constants and other alternative specific parameters for censored alternatives. Most of the work on parameter estimation with censored data has been focused on the MNL model because of the convenient mathematical properties of this model. However, it has been shown that if the estimated choice model is MNL, the IIA property prevents the identification of alternative specific constants (or other alternative specific parameters) for censored alternatives, unless some external information (beyond the sample of choice observations) is available (Newman et al., 2012). The outside information can be (but does not necessarily need to be) known market shares for the observable and censored alternatives. It could also be an assumption of a constant arrival rate of potential customers (Talluri and van Ryzin, 2004), the known market share of just the censored alternatives (Vulcano et al., 2012) or an unknown total market size that is assumed to be stable over time (Newman et al., 2012). For certain other choice models, no outside data is required, as has been demonstrated for the nested logit (NL) model (Newman et al., 2013). No outside data is required for the more general models we consider in this paper, as well. Intuitively, this is because the inclusion of covariance terms results in a system of equations that allows identification of alternative-specific parameters for censored alternatives for particular nesting structures.
Much of the literature on parameter estimation with censored data has focused on the unique nature of the problem. But because censored data is a type of choice-based sampling, it is possible to adapt some existing choice-based sampling parameter estimation techniques to censored data. Nevertheless, care must be taken in selecting appropriate tools. As we will outline in Section 3, not all choice-based sampling methodologies will work with censored data.
In this paper, we extend our recent work on parameter estimation with censored data (Newman et al., 2013) in three important ways. First, we outline how certain existing estimation methodologies for choice-based sampling (both conditional maximum likelihood and expectation maximization) can be used successfully with censored data. Second, we identify necessary conditions for the unique identification of alternative specific constants and parameters in models that are members of the generalized extreme value (GEV) family (McFadden, 1978) of discrete choice models, including an examination of several specific cases. Lastly, we extend the methodology to consideration of multiple simultaneously censored alternatives, and identify conditions where alternative specific constants and parameters of those censored alternatives can and cannot be separately identified.
In service of these contributions, the body of this paper is structured as follows: in the next section, we review the structure of GEV models, and the resulting modeled probabilities. In Section 3 we examine two different algorithms that can be used for parameter estimation with censored data. In Section 4, we examine details of parameter identification in three distinct members of the GEV family: a single-level nested logit, a multi-level nested logit, and a cross-nested logit model. Section 5 offers a variety of simulated and real empirical examples, providing some insight into finite sample stability of parameter estimates. Finally, we offer some conclusions and thoughts on future avenues of research.
Section snippets
Generalized extreme value models
A general form of a utility maximization model for discrete choice can be created by writing the utility ascribed by a given decision maker n for each alternative i in the set of possible alternatives aswith Vi(xn,β) as a systematic, calculable utility function for alternative i that is derived from a vector of parameters β and a vector of observable data xn for decision maker n, and εi(γ) as a random unobserved error term for alternative i parameterized by a vector γ
Estimation approaches
We can set up the estimation problem by dividing the set of all possible alternatives into two mutually exclusive and collectively exhaustive subsets: the set of alternatives which are observable, and the set of alternatives which are censored and unobservable. We will also define as βU the subset of the components of the utility function parameters β that are associated exclusively with the utility of censored alternatives and do not impact observable utilities, i.e. parameters βk where
Parameter identification
When estimating parameters for a MNL model, unique values for βU cannot be found because they are not identified given the available data. This is expressed mathematically by noting that for the MNL model in (5), logGi(· · ·) = 0 for all i, and therefore with (10) we can conclude
Because by definition any parameter βk in βU does not impact the systematic utility of any observable alternative (i.e., the antecedent of (13) is true), therefore any parameter βk in β
Application
To evaluate the performance of the proposed algorithms, we examined several simulated examples derived from two datasets: Swissmetro and the San Francisco Metropolitan Transportation Commission’s (MTC) Work Mode Choice data.
Conclusions
This paper builds on prior work with censored data from both revenue management and transportation planning contexts, and offers three notable contributions to the literature on censored data: introducing CML as an effective tool for parameter estimation, demonstrating the potential and limits of estimating a variety of GEV models, and expanding the application to multiple censored alternatives. Together, these tools offer great promise in the development of new models and approaches,
Acknowledgements
Partial support for this research was provided by a National Science Foundation Grant, SES-1130745. Data for the Virgin America and JetBlue example was collected as part of NSF Career Grant SES-0846758. We also greatly appreciate the efforts of the editors and three anonymous reviewers, whose helpful comments greatly improved this work.
References (26)
- et al.
The estimation of generalized extreme value models from choice-based samples
Transportation Research Part B
(2008) - et al.
The paired combinatorial logit model: properties, estimation and application
Transportation Research Part B
(2000) Normalization of network generalized extreme value models
Transportation Research Part B
(2008)EM algorithms for nonparametric estimation of mixing distributions
Journal of Choice Modelling
(2008)- et al.
The generalized nested logit model
Transportation Research Part B
(2001) - Ben-Akiva, M., François, B., 1983. μ-Homogeneous Generalized Extreme Value Model. Working...
An endogeneous segmentation mode choice model with an application to intercity travel
Transportation Science
(1997)- Bierlaire, M., 2003. Biogeme: a free package for the estimation of discrete choice models. In: 3rd Swiss Transportation...
A theoretical analysis of the cross-nested logit model
Annals of Operations Research
(2006)- Bierlaire, M., Axhausen, K., Abay, G., 2001. The acceptance of modal innovation: the case of swissmetro. In: 1st Swiss...
Maximum likelihood estimator for choice-based samples
Econometrica
Cited by (6)
Impact of omitted variable and simultaneous estimation endogeneity in choice-based revenue management systems
2024, Transportation Research Part A: Policy and PracticeInvestigating airline customers' premium coach seat purchases and implications for optimal pricing strategies
2015, Transportation Research Part A: Policy and PracticeCitation Excerpt :To analyze how customers purchase premium coach seats with extra legroom, automated web client robots (or webbots) were used to query JetBlue’s website and obtain detailed itinerary, fare, and seat map information for nonstop flights on a daily basis. Our paper is one of many that have used airline webbot data to analyze pricing and/or demand trends (e.g., see Bilotkach, 2006; Bilotkach and Pejcinovska, 2012; Bilotkach et al., 2010; Button and Vega, 2006, 2007; Horner et al., 2006; McAfee and Vera, 2007; Mentzer, 2000; Mumbower and Garrow, 2010; Newman et al., 2013; Pels and Rietveld, 2004; Pitfield, 2008; Pope et al., 2009; Dobson and Piga, 2013; Escobari, 2014). The period of data collection ran from August 5, 2010 through October 1, 2010.
Estimating flight-level price elasticities using online airline data: A first step toward integrating pricing, demand, and revenue optimization
2014, Transportation Research Part A: Policy and PracticeCitation Excerpt :It highlights information about the data that is relevant for interpreting results. For additional information on the pricing data, readers are referred to Mumbower and Garrow (2014) and to other papers that have used this data for pricing and revenue management applications (e.g., Newman et al., 2013; Mumbower et al., 2013). We predict demand for JetBlue flights in four transcontinental markets.
Data set-online pricing data for multiple U.S. carriers
2014, Manufacturing and Service Operations Management
- ☆
Presented at IATBR Toronto, July 2012; Submitted to Transportation Research Part B, November 2012, revised April and July 2013.