Estimating GEV models with censored data

https://doi.org/10.1016/j.trb.2013.09.002Get rights and content

Highlights

  • Some, but not all, methods for choice-based samples can be applied to censored data.

  • ASCs and other parameters for censored alternatives can be identified in many cases.

  • Having an entire nest censored in a NL model does not necessarily prevent estimation.

Abstract

We examine the problem of estimating parameters for Generalized Extreme Value (GEV) models when one or more alternatives are censored in the sample data, i.e., all decision makers who choose these censored alternatives are excluded from the sample; however, information about the censored alternatives is still available. This problem is common in marketing and revenue management applications, and is essentially an extreme form of choice-based sampling. We review estimators typically used with GEV models, describe why many of these estimators cannot be used for these censored samples, and present two approaches that can be used to estimate parameters associated with censored alternatives. We detail necessary conditions for the identification of parameters associated exclusively with the utility of censored alternatives. These conditions are derived for single-level nested logit, multi-level nested logit and cross-nested logit models. One of the more surprising results shows that alternative specific constants for multiple censored alternatives that belong to the same nest can still be separately identified in nested logit models. Empirical examples based on simulated datasets demonstrate the large-sample consistency of estimators and provide insights into data requirements needed to estimate these models for finite samples.

Introduction

Current estimation methods for discrete choice models generally assume that all alternatives are observed to have been chosen for at least some observations in the estimation dataset. The simplest estimators are derived from an assumption that the sample of observations represents a purely random selection of possible observations (i.e., the population). However, using a purely random sample is not always desirable or possible. For example, using a purely random sample of travelers may be undesirable when modeling use of low ridership modes, because a prohibitively large sample would be required to ensure that sufficient quantities of users are sampled to accurately model their preferences. Moreover, in oligopolistic markets, it may even be deemed illegal collusion for competitors to share data about customers. The latter motivates the goal of this paper: to develop estimators for discrete choice models in which one or more alternatives is never observed to have been chosen in the estimation dataset; however, information about the censored alternatives is still available. This problem can be viewed as an extreme case of non-random sampling for the estimation data.

Accommodating stratified samples, where the selection of observations in the sample is not purely random, can be roughly divided into two categories: exogenous samples, where the probability of an observation being sampled is related to some attributes of the alternatives or the decision makers but unrelated to the observed choice; and endogenous samples, where the probability of an observation being sampled is related directly to the observed choice. This second case is often called “choice-based” stratified sampling.

A number of modifications to the basic maximum likelihood estimation procedure have been proposed in the literature to accommodate choice-based stratified samples, both for situations where the market shares for the various alternatives are known, and for situations where they are not known. When shares are known, Manski and Lerman (1977) showed it is possible to employ weighted exogenous sample maximum likelihood (WESML), which provides consistent estimators. However, WESML can incur a substantial loss in estimator efficiency (i.e., the standard errors of the estimates are large), notably when the variance in the weights on the observations is large. It is also possible to estimate parameters with choice-based sampling by using conditional maximum likelihood (CML), proposed by Manski and McFadden (1981). The CML methodology can even be used when market shares are unknown and must be estimated alongside the other model parameters (Hsieh et al., 1985). In the case of a multinomial logit (MNL) model with a full set of alternative specific constants, when the market shares are known the CML method reduces to ESML with post-hoc adjustments to the estimated constants. But when market shares and relative sample rates for the various alternatives are not known, the independence of irrelevant alternatives (IIA) property of the MNL model ensures that while consistent estimators for other parameters are available, the true market shares of the alternatives are unidentifiable.

We examine an extreme form of choice-based sampling: instances where one or more of the alternatives is systematically excluded from the sample used to estimate parameters, i.e. the sampling probability for those alternatives is zero. We term this condition “censored” sampling, the resulting sample as “censored data”, and the alternative[s] that have zero sampling frequency as “censored alternatives”. Under more typical choice based sampling conditions, the probability of individual decision makers being included in the sample is a function of the observed choice, and while some choices result in a smaller probability of being included than others, all decision makers have a non-zero chance of being sampled, and all possible choices are ultimately represented in the sample. With censored data, this is not the case.

Censored data, as we define it here, does not mean that no information about the censored alternatives is available or collected. It merely means that decision makers who choose a censored alternative are never sampled. Importantly, when a decision maker who selects one of the other (uncensored) alternatives is sampled, it is still possible to observe or construct the attributes of both the chosen and non-chosen alternatives, including the censored alternatives. For example, if automobile users are censored in a mode choice model, that means that no automobile users appear in the sample, but the hypothetical travel times and costs for automobile travel can still be computed for users of other modes of travel. This is not substantially different than would be necessary for those observations even if auto users were not censored.

This type of censored data can arise in a variety of contexts. In the transportation planning context, censored data might arise from data collection constraints, such as limited funding or an oversight in survey design. For example, a travel survey might have been conducted which ignored bicycle users, but during subsequent modeling applications policy makers might suddenly feel that bicycling is related to their policy goals and that it should be included in the models. Censoring is particularly common in revenue management contexts, where a firm in a competitive marketplace is attempting to set price and availability of products so as to maximize profit. In that case, the data can be censored because observations of purchase decisions of the firm’s own products are readily available, but observations of purchases of competitors’ products are not, and for competitive or legal reasons that information may never be available. Moreover, some potential customers may choose to not purchase any product at all this choice is referred to as the “no purchase” alternative, or the “outside good”.

Recent works in revenue management (Talluri and van Ryzin, 2004, Vulcano et al., 2010, Vulcano et al., 2012) and transportation planning (Newman et al., 2012, Newman et al., 2013) have examined the censored data estimation problem, and proposed methodologies to estimate discrete choice model parameters, including alternative specific constants and other alternative specific parameters for censored alternatives. Most of the work on parameter estimation with censored data has been focused on the MNL model because of the convenient mathematical properties of this model. However, it has been shown that if the estimated choice model is MNL, the IIA property prevents the identification of alternative specific constants (or other alternative specific parameters) for censored alternatives, unless some external information (beyond the sample of choice observations) is available (Newman et al., 2012). The outside information can be (but does not necessarily need to be) known market shares for the observable and censored alternatives. It could also be an assumption of a constant arrival rate of potential customers (Talluri and van Ryzin, 2004), the known market share of just the censored alternatives (Vulcano et al., 2012) or an unknown total market size that is assumed to be stable over time (Newman et al., 2012). For certain other choice models, no outside data is required, as has been demonstrated for the nested logit (NL) model (Newman et al., 2013). No outside data is required for the more general models we consider in this paper, as well. Intuitively, this is because the inclusion of covariance terms results in a system of equations that allows identification of alternative-specific parameters for censored alternatives for particular nesting structures.

Much of the literature on parameter estimation with censored data has focused on the unique nature of the problem. But because censored data is a type of choice-based sampling, it is possible to adapt some existing choice-based sampling parameter estimation techniques to censored data. Nevertheless, care must be taken in selecting appropriate tools. As we will outline in Section 3, not all choice-based sampling methodologies will work with censored data.

In this paper, we extend our recent work on parameter estimation with censored data (Newman et al., 2013) in three important ways. First, we outline how certain existing estimation methodologies for choice-based sampling (both conditional maximum likelihood and expectation maximization) can be used successfully with censored data. Second, we identify necessary conditions for the unique identification of alternative specific constants and parameters in models that are members of the generalized extreme value (GEV) family (McFadden, 1978) of discrete choice models, including an examination of several specific cases. Lastly, we extend the methodology to consideration of multiple simultaneously censored alternatives, and identify conditions where alternative specific constants and parameters of those censored alternatives can and cannot be separately identified.

In service of these contributions, the body of this paper is structured as follows: in the next section, we review the structure of GEV models, and the resulting modeled probabilities. In Section 3 we examine two different algorithms that can be used for parameter estimation with censored data. In Section 4, we examine details of parameter identification in three distinct members of the GEV family: a single-level nested logit, a multi-level nested logit, and a cross-nested logit model. Section 5 offers a variety of simulated and real empirical examples, providing some insight into finite sample stability of parameter estimates. Finally, we offer some conclusions and thoughts on future avenues of research.

Section snippets

Generalized extreme value models

A general form of a utility maximization model for discrete choice can be created by writing the utility ascribed by a given decision maker n for each alternative i in the set of possible alternatives C asUi(xn,β,γ)=Vi(xn,β)+εi(γ),with Vi(xn,β) as a systematic, calculable utility function for alternative i that is derived from a vector of parameters β and a vector of observable data xn for decision maker n, and εi(γ) as a random unobserved error term for alternative i parameterized by a vector γ

Estimation approaches

We can set up the estimation problem by dividing the set of all possible alternatives C into two mutually exclusive and collectively exhaustive subsets: the set of alternatives O which are observable, and the set of alternatives U which are censored and unobservable. We will also define as βU the subset of the components of the utility function parameters β that are associated exclusively with the utility of censored alternatives and do not impact observable utilities, i.e. parameters βk whereV

Parameter identification

When estimating parameters for a MNL model, unique values for βU cannot be found because they are not identified given the available data. This is expressed mathematically by noting that for the MNL model in (5), logGi · ·) = 0 for all i, and therefore with (10) we can concludeVi(x,β)βk=0,iOlogL(β,γ)βk=0,βk.

Because by definition any parameter βk in βU does not impact the systematic utility of any observable alternative (i.e., the antecedent of (13) is true), therefore any parameter βk in β

Application

To evaluate the performance of the proposed algorithms, we examined several simulated examples derived from two datasets: Swissmetro and the San Francisco Metropolitan Transportation Commission’s (MTC) Work Mode Choice data.

Conclusions

This paper builds on prior work with censored data from both revenue management and transportation planning contexts, and offers three notable contributions to the literature on censored data: introducing CML as an effective tool for parameter estimation, demonstrating the potential and limits of estimating a variety of GEV models, and expanding the application to multiple censored alternatives. Together, these tools offer great promise in the development of new models and approaches,

Acknowledgements

Partial support for this research was provided by a National Science Foundation Grant, SES-1130745. Data for the Virgin America and JetBlue example was collected as part of NSF Career Grant SES-0846758. We also greatly appreciate the efforts of the editors and three anonymous reviewers, whose helpful comments greatly improved this work.

References (26)

  • Bureau of Transportation Statistics, Office of Airline Information, 2010. Airline origin and destination survey...
  • Chu, C., 1989. Paired combinatorial logit model for travel demand analysis. In: Fifth World Conference on...
  • S. Cosslett

    Maximum likelihood estimator for choice-based samples

    Econometrica

    (1981)
  • Cited by (6)

    • Investigating airline customers' premium coach seat purchases and implications for optimal pricing strategies

      2015, Transportation Research Part A: Policy and Practice
      Citation Excerpt :

      To analyze how customers purchase premium coach seats with extra legroom, automated web client robots (or webbots) were used to query JetBlue’s website and obtain detailed itinerary, fare, and seat map information for nonstop flights on a daily basis. Our paper is one of many that have used airline webbot data to analyze pricing and/or demand trends (e.g., see Bilotkach, 2006; Bilotkach and Pejcinovska, 2012; Bilotkach et al., 2010; Button and Vega, 2006, 2007; Horner et al., 2006; McAfee and Vera, 2007; Mentzer, 2000; Mumbower and Garrow, 2010; Newman et al., 2013; Pels and Rietveld, 2004; Pitfield, 2008; Pope et al., 2009; Dobson and Piga, 2013; Escobari, 2014). The period of data collection ran from August 5, 2010 through October 1, 2010.

    • Estimating flight-level price elasticities using online airline data: A first step toward integrating pricing, demand, and revenue optimization

      2014, Transportation Research Part A: Policy and Practice
      Citation Excerpt :

      It highlights information about the data that is relevant for interpreting results. For additional information on the pricing data, readers are referred to Mumbower and Garrow (2014) and to other papers that have used this data for pricing and revenue management applications (e.g., Newman et al., 2013; Mumbower et al., 2013). We predict demand for JetBlue flights in four transcontinental markets.

    • Data set-online pricing data for multiple U.S. carriers

      2014, Manufacturing and Service Operations Management

    Presented at IATBR Toronto, July 2012; Submitted to Transportation Research Part B, November 2012, revised April and July 2013.

    View full text