Comparing best-worst and ordered logit approaches for user satisfaction in transit services

Customer overall satisfaction towards a public transport system depends mainly on two factors: how satisfied they are with different aspects that make up the service and how important each of the service aspects is to the customer. Traditionally, researchers use revealed preference surveys and ordered probit/logit models to estimate the contribution of each service attribute towards the overall satisfaction. This paper aims to verify the possibility of replacing the traditional method with the more cost-effective best-worst case 1 method, using a customer survey recently conducted in Santander, Spain. The results show that the satisfaction level obtained from these alternative methods are remarkably similar. The relative importance of each attribute delivered by the two methods differ, with the Best-Worst approach showing more intuitive and consistent results with the literature on public transport customer satisfaction. A regression method is developed to derive customer satisfaction with each service attribute from Best-Worst modelling results.


Introduction
Ordered logit and probit models have been widely used in public transport satisfaction studies (Alonso et al., 2018;Bordagaray et al., 2014;Echaniz et al., 2018). These models predict the overall quality of a transport service (i.e., overall satisfaction) based on the extent to which users are satisfied with each of the service attributes such as travel time, spatial coverage or service frequency. Thus, satisfaction data on each of the attributes that define the service have to be collected, usually based on a Likert scale on which respondents indicate their level of satisfaction with each service attribute and overall. Such surveys are usually lengthy and repetitive, resulting in low response rate or loss of sample due to respondent burden. For example, in a recent on-board surveys of bus users in Santander, Spain (Echaniz et al., 2018) many respondents were not able to finish the questionnaire based on traditional rating method. Consequently, interviewers had to leave the bus with the respondent in order to complete the survey. Not only reducing the interviewer's productivity but also complicating the logistics of survey.
An alternative approach to studying customer satisfaction and the contributions of different service attributes to the customer satisfaction is the Best-Worst (BW) case 1 method. Rather than asking the respondent to evaluate each service attribute at a time, as in the traditional rating method, the BW method shows the respondent a set of service attributes at the time and asks them to choose the best and the worst of the attributes shown and the process is repeated until all attributes are covered. Thus, BW surveys are less time consuming and more intuitive for respondents, requiring less guidance from the interviewer. Hence, BW survey method is a cost-efficient way to obtain data required for satisfaction studies, and yet this method has not been very used to study customer satisfaction in public transport systems, with only few exceptions (Beck et al., 2017;Beck and Rose, 2016). This paper investigates the possibility of replacing the traditional approach to customer satisfaction with a best-worst case 1 survey method (Louviere et al., 2015). If this potential of BW method is verified, a more flexible and effective way to study customer satisfaction will be achieved, without imposing much burden on the respondent (through shorter questionnaires) or compromising the statistical robustness of the results. With respect to the latter, pioneer work of Beck and Rose (2016) suggests that the relative importance of service attributes identified by conventional rating methods can be biased. This limitation of the traditional method must be overcome, in Cao and Cao (2017) for example it was found that implicit importance (attribute importance derived from modelling estimates) based on ordered logit models is presumably a better indicator for decision making than the traditional rating. Motivated by these observations, this paper aims to compare the suitability of BW methods with that of traditional rating methods in studying customer satisfaction within the context of bus services.
The remainder of this paper is structured in to five sections. The next section reviews the relevant literature. Section 3 describes the data, the survey design and the sample. Section 4 explains the models, with estimation results presented in Section 5. The paper ends with the conclusions of the main findings, with some directions for future study.

Literature review
Budget constraints mean that investment options to improve services should be examined to maximise potential benefit for the given budget. One way to identify the most potential areas for improvement is to establish the extent to which users are satisfied with different service attributes. This can be achieved by asking users to give their level of satisfaction towards each service attribute through a traditional satisfaction survey with some ranking scale such as a 5-point Likert. However, investments should not be made based solely on a low level of satisfaction towards a certain attribute, since not all aspects of the service affect the users in the same way, with some service attributes being more important than others. A common solution in public transport (PT) customer satisfaction study is to ask the users how satisfied they are with the service alongside with how important each service attribute is to them. In doing so, the operator knows where to focus the improvement efforts to get the biggest benefit.
Customer satisfaction in PT services has traditionally been studied using revealed preference (RP) surveys. Different inception and interview methods have been used, such as on-board interviews (de Oña et al., 2013;Eboli and Mazzula, 2009;Eboli and Mazzulla, 2011;Echaniz et al., 2018), online surveys (Abenoza et al., 2017;Beck and Rose, 2016;Rose and Hensher, 2018) and mobile app (Guirao et al., 2015). Users evaluate different aspects of the PT service based on their experience using it, typically referred to as service attributes, based on qualitative or quantitative scales. These evaluations define the level of satisfaction users have with the different aspects of the PT service and also with the service bundle as a whole. The attributes to be evaluated by the respondent are usually derived from previous studies, as in the case of Hensher et al. (2003) where 13 representative items of the service were defined, or in Efthymiou et al. (2018) where factors that affect satisfaction are analysed for times of economic crisis. The subsequent analysis of the obtained data has been carried out in different ways over the years. The simplest and most popular method is descriptive analysis in which the mean and deviation of the level of satisfaction with each attribute and the entire service (i.e., overall satisfaction) are used to represent customer satisfaction. Thus, most studies on user satisfaction perform this basic analysis before applying a more complex and advanced method (e.g., de Oña et al., 2017;Eboli et al., 2018;Eboli and Mazzulla, 2015;Gonzalo-Orden et al., 2011;Tyrinopoulos and Antoniou, 2008).
The importance users place to different PT service attributes can be established using different methods, but they could be grouped into two approaches. The first approach uses the stated importance directly (explicit importance), where the customers rate each attribute on an importance scale similar to the one used for rating satisfaction. The second approach use derived importance in which the relative importance of each attribute is inferred by statistically analysing the relationship of individual attributes with the overall satisfaction (implicit importance) (Weinstein, 2000). Numerous studies were found to use the first approach to obtain the importance of the attributes (Beck and Rose, 2016;Guirao et al., 2016;Rose and Hensher, 2018) although this approach leads to an increase in the duration of the questioning, which means that surveys can become excessively long.
However, there are more efficient ways to obtain the importance level of the attributes, as is the case with the Best-Worst (BW) scaling (Louviere et al., 2015). Three types of BW exercises are used in the literature. These are the object case (Case 1), the profile case (Case 2), and the multi-profile case (Case 3). In Case 1, the respondent is asked to select the best and worst options from a series of objects or items such as a list of brand names, without showing any attribute or characteristic apart from the item itself. In Case 2, the respondent is asked to select the best and worst from a list of attributes, each of which was assigned a specific level determined by some experimental design. The choice is made between the different attributes with each one having their own set of levels. Case 3 is associated with the classic discrete choice experiments, where the choices are made between a set of alternatives composed of different attributes with different levels. The choice is then made between the different alternatives available which are composed by the same attributes but with different levels on each one. BW method have been successfully used in different transport studies. For instance, Beck et al. (2017) used a BW case 3 study to identify the respondent's attitudes towards choosing electric vehicles in the presence of regular fuelled alternatives. Or in Mulley et al., (2014) where BW case 1 was used to study the preference of the citizens regarding the construction of a Bus Rapid Transit (BRT) or a Light Rail Trains (LRT) service. In the context of public transport service satisfaction, Beck and Rose (2016) applied also BW case 1 method and compared it with the traditional rating method. They concluded that the traditional way of establishing the importance of the variables is biased in which the service attributes people most satisfied with were associated with the highest levels of importance. In turn, it was observed that traditional responses did not show great variability. Beck and Rose (2016) also concluded that the traditional rating did not provide enough information for decision making. On the other hand, the BW method made it possible to better capture these variations. The correlation between satisfaction and importance was found to be much more coherent using the BW method, proving to be a more useful decision tool. Following Beck and Rose (2016) this study uses a BW case 1 method instead of a traditional method.
The analysis of the BW data has been made following two main techniques. In one hand, score measures i.e. analytical closed form (ABW) (Lipovetsky and Conklin, 2014) or normalized BW scores (NBW) (Louviere et al., 2015). On the other hand, discrete choice models have been applied, such is the case on Beck and Rose (2016) or Marley and Pihlens (2012). Most used discrete choice model is the Multinomial Logit. In Marley et al. (2016) an interesting comparison is made between ABW and NBW score measures, also, they compare the results obtained with a MNL model. The results show that for all three methods ABW, NBW and MNL are very close related.
Regarding the implicit importance, over the years different modelling methods have been used in order to find the most correct way to define the importance of the different attributes that define a service. de Oña and de Oña (2015) and delĺOlio et al. (2018) provide a comprehensive review of the methods. Recently, Allen et al. (2018b) and Allen et al. (2019) uses the structural equations model to study user satisfaction with Transantiago and Metro Madrid respectively. Mouwen (2015) performed an analysis of the public transport satisfaction in the Netherlands, the method used was the multiple regression. Discrete choice models have been also used, specially ordered models, both Logit and Probit (Allen et al., 2018a;Bordagaray et al., 2014;dell'Olio et al., 2010;Echaniz et al., 2018). Another way based on discrete choice modelling has been by carrying out stated preference (SP) surveys, which show the respondent a number of choice tasks and ask them to choose the one they most preferred. The data is analysed using discrete choice models of some kind. For instance, Román et al. (2014)  Regardless of the data analysing method, the most statistically significant factors in most of these studies are the frequency of the service, the reliability and travel time, and to a lesser extent, the comfort of the buses and the smoothness of the ride. Allen et al. (2018a,b) showed that having a reliable service and a good frequency were the most influential attributes when explaining the users' satisfaction with the public bus system. The perceived waiting and travel times were also found to be of great importance. In addition, Mouwen (2015) showed that on-time performance, travel speed and frequency are most important attributes when explaining the overall quality of the service. Similarly, Román et al. (2014) showed that urban users has a greater willingness to pay for waiting time, travel time and access time. In dell' Olio et al. (2011) it was shown that in order to attract users it is necessary to increase the overall quality of the system by improving the waiting time, the comfort during the trip, the sources of information and the frequencies.
Thus, it appears that the majority of studies on public transport service satisfaction arrive at similar conclusions regarding the main drivers of customer satisfaction, even when using different modelling methods and datasets. Only one study was found that compares the results of an Ordered Logit model with a conventional rating based importance level (Cao and Cao, 2017). The main finding of that study was that the importance levels obtained with both methods were different. To the best of the authors' knowledge, no study has analysed the relationship between conventional rating and modelling with BW data focused in public transport satisfaction.

Survey
The survey includes two parts. The first part asked respondents to report their socio-economic characteristics (e.g., gender, age, work status, income level), level of bus usage (e.g., trip purpose and number of trips per week) as well as the availability of alternative modes for these trips. The second part involved the respondent evaluating the importance and satisfaction towards different service attributes. A total of 24 service attributes, shown in Table 2, were used to define the services based on the existing international literature and several focus groups carried out in the city of Santander . These service attributes were grouped into six sets of four attributes each. Each respondent was asked to evaluate three sets of attributes allocated dynamically and randomly such that no attribute appeared twice for the same respondent and the total sample provides a balance of attributes assessed.
Each respondent was asked to evaluate the same set of attributes based on both traditional ranking (5-point Likert scale) and bestworst response mechanisms. In the traditional rating exercise, the respondent was asked to rate each attribute on a 5-point Likert scale (3 sets × 4 attributes/set = 12 attributes in total). In the best-worst scalling exercise, the respondent was asked to select, out of the same four attributes included in the choice task, which attribute they are most and least satisfied with (satisfaction choice), as well as which attributes are most and least important to them (importance choice). Fig. 1 summarise the data collected from one of the three choice tasks showed to each respondent. Respondents were not shown the three tasks of one question simultaneously but one after the other. At the end of the survey, all respondents were asked to rate the service as a whole, defined as Overall Satisfaction. The overall satisfaction was obtained following the same 5-point rating scale used for the attributes.

Sample
The surveys were run between October and November 2017 in the city of Santander. Face-to-face interviews were conducted on four bus lines operated by the municipal public company in the urban area of Santander. A total sample includes 808 completed interviews, spreading across the whole day with interviews taken place both at bus stops and on board. Table 1 shows the main characteristics of the respondents. Women are over-represented in the sample (two in three respondents). A quarter of respondents are under 25 years old, while other age groups are more balance. Regarding occupation, almost half (47%) of the respondents were employed and nearly a quarter (24%) were students with the balance being retirees (17%), unemployed (8%) Fig. 1. Example data collected from one choice task based on Rating vs. BW scales.

Table 1
Respondents' socio-economic information. and housewives (5%). Half of the respondents had some other motorized alternative to make the same journey, while only a 6% would be willing to make the same journey by bicycle. The trips captured in the survey showed the important role of bus services in Santander for commuting purposes, i.e. travelling between home and office/school. Nearly half of the trips (46%) have the home as an origin and more than a quarter (29%) have the home as a destination. Work was the second main reason, both as an origin and a destination. The respondents are mainly habitual users with a low frequency of use per day. Specifically, more than half (54%) of the respondents use bus services up to 15 time per week. Finally, due to the sensitivity of the question, 42% of respondents decided not to answer the question related to their income level. Of the people who did answer, income levels have a good mix, with a greater proportion of people with an average salary: 20% of people between 900 and 1500 € per month and 17% between 1500 and 2500€. Table 2 shows the means of user satisfaction with each of the 24 service attributes in a descending order and the overall satisfaction. The traditional rating scale is based on a 5-point Likert scale and the rating is recoded to have the value from 0 (very unsatisfied) to 4 (very satisfied) for descriptive and modelling analysis.

Rating scale results
Overall, the respondents are quite satisfied with the service, with an average rating of 2.69 out of 4. The service attribute that users are most satisfied with relates to the use of sustainable propulsion engines with hybrid vehicles (HY), being the only attribute that has the level of satisfaction exceeding 3. In total, nine attributes have a level of satisfaction greater than average, with the lowest level of satisfaction observed for noise level (NO), air conditioning (CA) and fare (PR).

Model specifications
This section describes the modelling approaches used to model customer satisfaction data obtained from the empirical survey. First, an Ordered Logit model is discussed and shown how it is used for modelling the data obtained from the traditional rating responses. This is followed by a specification of two standard logit models for the empirical data obtained from BW responses: one model for the level of satisfaction and another model for level of importance.

Ordered Logit for traditional raking data
Ordered Logit models are based on the following specification of a latent regression: in which the latent continuous preference variable q i is only observed in discrete form q i through a censoring mechanism: = Note that the specification in Eqs. (1) and (2) assumes that neither parameters β nor thresholds µ vary across individuals. This assumption of homoscedasticity is arguably strong and can be relaxed. The vector x i is a set of K covariates that are assumed to be independent of ; i and is a vector of K parameters to be estimated, together with + J 2 threshold parameters µ j using N observations. The assumption of the disturbance i completes the model specification. The conventional assumptions are that i is continuous, random and follows a certain cumulative distribution function (CDF), F(ε i |x i ) = F(ε i ).
For this study, q i represents the non-observable overall satisfaction of the PT service, while q i is the observable overall satisfaction obtained from the traditional rating question asked at the end of the survey. J represents the 5-point Likert scale options of the rating task shown in Fig. 1; x i are the satisfaction ratings of the service attributes assessed by the respondent i with values ranging between 0 ("Very Bad") and 4 ("Very Good").
The dependent variable of the model is defined as the overall satisfaction (OS), while the independent variables are the satisfaction levels of the different attributes of the service. In total 24 independent variables or service attributes have been defined; however, conducting on-board face to face surveys means that a large proportion of bus users would not have enough time to provide their level of satisfaction towards each of the 24 attributes. Thus, each respondent was asked to rate 12 of the 24 attributes, which generates an additional modelling challenge but that can be solved using imputation methods to complete the database. The method used to complete the sample has been based on Multiple Imputation (Rubin, 1978(Rubin, , 1977, explained below. Echaniz et al. (2019) have shown that it is possible to obtain Ordered satisfaction models by using this method and that results obtained using this method are similar to those obtained with a complete database.
Multiple imputation is estimated by using a procedure called Fully Conditional Specification (FCS), which uses an iterative Monte Carlo method with Markov chains (van Buuren, 2007). The FCS approach is based on variable-by-variable imputation of data, specifying an estimation model for each one of the variables with missing data. The FCS tries to define P X C R ( , , | ) by specifying a conditional density P C R (X | , X , , ) for each X i , this density is used to impute X i mis given some C, X i and R. An imputation consists of a complete cycle through all X i (van Buuren, 2007). Where X represents the evaluation of the attributes, C the characterization variables, the parameters of the imputation model and R an indicator that show if X is a missing or observed value. The imputation is made by using the Gibbs sampling methodology (Casella et al., 2016;Gilks et al., 1996) assuming that the conditional density distribution exists. This methodology has been used in a large number of simulation studies (Brand, 1999;Horton et al., 2016;Raghunathan et al., 2001;Van Buuren et al., 2006) that have provided sufficient evidence that the results obtained through the FCS are generally unbiased and have adequate coverage. Once the database is completed, the models are estimated as usual. To estimate the model some normalizations are required. First, to keep the positive sign of the probabilities it is required to > + µ µ j j 1 . As the variable q i exists in the entire real line and the model contain a constant term = 0 0 , it is necessary to define = µ 0 0 and = + µ J . As. The data does not contain information about the scaling of the dependent variable q i , therefore, the free variance parameter = Var ( ) i 2 cannot be estimated. The usual approach is to assume that is constant and depends on the distribution assumed for i . In Logit models it is assumed that i follows a logistic distribution, resulting in = Var ( ) /3 i 2 . The associated probabilities are defined as: The model (3) is estimated using maximum likelihood estimator which maximises the log-likelihood function defined as follows: where F(·) is the cumulative distribution function; = m 1 ij if = q j i and 0 otherwise.

Multinomial Logit for Best-Worst scaling
The literature defines three type of BW data, known as Case 1, Case 2 and Case 3 as reviewed in Section 2. Aiming to verify the importance of different service attributes associated with bus services without including different attribute levels, this paper adopts the BW Case 1 method.
There are a total K attributes to be chosen on the survey. In each BW task a subset Y of four attributes is shown. With the answers of the choice task, a vector = ( , , ) k 1 is estimated, which is the utility coefficient of each attribute. The probability of choosing an attribute b Y as best is denoted as P b Y ( | ) B . In the same way, the probability of choosing an E. Echaniz, et al. Transportation Research Part A 130 (2019) 752-769 attribute w Y as worst is denoted as P w Y ( | ) W . The joint probability of choosing attribute b as best and attribute w b as worst is defined as P BW bw Y ( | ). In the experiment, the respondent had to select the best option and the worst option out of a subset of service attributes. The survey instrument was programmed in such a way that the respondent cannot advance if the same attribute was selected as both best and worst attributes. That is, the probability of the same attribute -call it x -being chosen as best and worst by the same respondent i (suppressed from the notations for simplicity), is always 0. Mathematically, or both must be 0. Adopting a standard logit specification to describe the choice of the best and the worst attributes, (i.e. assuming that the unobserved components of the utility follow Type 1 Generalized Extreme Value or Gumbel distribution with random variables independently and identically distributed), the probability P BW bw Y ( | ) for one BW choice task is defined as (5), which also is called the Maxdiff model (Marley and Louviere, 2005): where v (·) is the observable utility components specified as a linear-in-parameter function of attributes such as = v k y ( ) k k where y k is an indicator vector of 0 and 1 (y k = 1 when the attribute k is shown to the respondent i and 0 otherwise. In this way, the parameter estimate k could be interpreted as the importance or satisfaction level (depending on which model is being analysed) of attribute k relative to the reference/base attribute which has = 0 0 . The maxdiff model assumes that the respondent simultaneously chooses the best and the worst options; however, it may be possible that the respondent selects the best option first, then eliminate this attribute out of the choice set before selecting the worst option. In this case, the repeated best-worst model specification as in Eq. (6) may be more appropriate (Dyachenko et al., 2012).
Similarly, if we assume that the respondent selects the worst option first, then the best option, a repeated worst-best model as in Eq. (7) could be used (Dyachenko et al., 2012). Eqs. (6), (7) are alternative model specifications of equation (5). Empirical study (see for example Greene, 2016;Ho and Hensher, 2017) shows that these alternative model specifications are likely to produce similar results. Thus, this paper uses the Maxdiff model and acknowledges that the repeated best-worst or repeated worst-best specifications could be used as an alternative specification.

Discrete choice models
Three models were estimated using Nlogit v6.0: two MNL models based on the BW data obtained from the BW choice exercises and one OL Table 3 shows the parameter estimates with t-values shown in parentheses. The BW models show the order of the attributes in terms of satisfaction/importance levels. In each model, the parameter associated with the attribute that has the lowest level of satisfaction/importance is set at 0, allowing all other parameters to be positive, assisting parameter interpretation. Specifically, the value of a parameter identifies its position in the satisfaction/importance scale with the larger parameter being representative of a more satisfied/important attribute.
The remaining column show the values of the parameters of the OL model. The OL model is completed with the constant term and the threshold parameters, as was explained in Section 4. The dependent variable for this model is the overall satisfaction (OS) and the independent ones are all the attributes evaluated using the conventional rating scale. The values of the parameters in the OL model show how much each attribute contribute to explain the customer overall satisfaction with the service (i.e. OS).
For brevity, estimation results from the standard logit and order logit models are analysed in this article. These models assumes homogeneity in preference among individuals by estimating average parameters for the sample. This homogenous assumption is relaxed by estimating random parameter (RP) models (estimation results shown in the Appendix A). For brevity purpose, the comparison of alternative modelling approaches is conducted using standard models since the main findings, discussed below, still stand in light of the RP modelling results.
As stated in the BW satisfaction model, users are very satisfied with the company's environmental policy when deploying hybrid buses (HY). In addition, users are also very satisfied with the access times to the stops (AT) and the egress times (DT), this mean that bus users in Santander see a good spatial coverage. As for the less satisfactory attributes, the price of tickets (PR) can be found as the least satisfactory. The current tickets fares vary depending on user type and the payment system used; however, the price is not higher than other public transport services nearby. The result may suggest the existence of a strategic voting behaviour, in the sense that the respondents strategically voted down their satisfaction with transport fares to reduce the chance that operators may increase fares in the future. This behaviour was also observed in previous studies conducted in the same city (Echaniz et al., 2018). The environmental characteristics such as noise (NO) and heating systems (CA) also show low levels of satisfaction, as well as several attributes related to comfort during the trip: driving style (DS) and crowdedness (OC). Analysing the existing information channels in the service, it can be observed that the users are very satisfied with the information offered at stops, and somewhat less with the information available in mobile applications. While it is shown that for the remaining information sources (information on board the buses and on the website) users are not satisfied.
The importance BW based model shows that the most important attributes are those directly related to the basic characteristics of the service such as service reliability (SR), frequency (SE), on-board travel time (TT) and coverage of the lines (LC). Conversely, the least important attributes are noise level (NO), air conditioning systems (CA), information on board (IB) and on the website (IW).
According to the OL model, the attributes that show the high parameter values are service frequency (SE), service reliability (SR) and coverage of the lines (LC). By contrast, the driver kindness (DK), noise level (NO) and the use of hybrid technologies (HY) are the ones with the lowest parameter estimates. The threshold parameters show a nonlinearity in different rating points, which means that from the user's perspective, difference levels of effort are required to improve the service by one satisfaction point, such as from very bad to bad vs. from good to very good. These results are consistent with the findings obtained in the previous study developed in the same city (Echaniz et al., 2018).
With the data available we wanted to analyse if there is any connection between the models derived from the Best-Worst exercise and the results obtained from the Ordered Logit modelling and satisfaction ratings. Table 4 shows the correlation in parameters obtained from the two approaches. The correlation between the averages of the traditional satisfaction rating for each attribute and the BW satisfaction model is nearly perfect, with correlation coefficient of 0.95. The ordered Logit model shows a considerable correlation (r = 0.486) with the BW model of importance.
A deeper investigation into these strong correlations is presented in Figs. 2 and 3. Parameter values differ in scale from one model to another, and thus a direct comparison of parameter estimates does not show the true correlation between them. Therefore, both  Echaniz, et al. Transportation Research Part A 130 (2019) 752-769 sets of the parameter estimates were standardized, providing some positive and some negative standardized scores showed in Figs. 2 and 3. The first comparison has been made between the two values representing the satisfaction. On the one hand, we have the information based on the classic revealed preference survey rating, obtaining the average satisfaction with values between 0 and 4. On the other hand, the MNL model based on the BW satisfaction data. As can be seen in Fig. 2, the correlation between these two values is considerably high. Most of the attributes show a similar tendency for both values although there are a few exceptions such as information in mobile apps (IM), line coverage (LC), information on board (IB) and quality of the stops (ST). Therefore, both methods lead to the same results, lending support to the hypothesis that BW method can replace the traditional satisfaction rating. For the comparison of the importance, the Ordered logit and MNL models based on the importance BW have been selected. The value that a parameter has in an Ordered model can be associated with the weight it has when explaining the dependent variable. In other words, the parameter explains the contribution of each attribute to the overall satisfaction. An analysis of these two set of parameters, presented in Fig. 3, shows a lower level of correlation than previous satisfaction one. Even the level of importance of the most important attributes shows a similar trend in both models, the rest of the attributes show very small correlation. In consequence, although both models represent a certain level of importance of the variables, these values are not the same and, therefore, represent different importance concepts.

Importance-performance analysis
The importance-performance analysis (IPA) (Martilla and James, 1977) is a widely used decision tool. The basis of this method is to cross both result (performance level and importance) in the same graph. Four quadrants are defined, each of them with a different level of improvement priority. The four quadrants are typically identified as 'keep up the good work' (Q1 -"Important and satisfied"), 'possible overkill' (Q2 -"Not important and satisfied"), 'low priority' (Q3 -"Not important and not satisfied") and 'concentrate here' (Q4 -"Important and not satisfied") (Sever, 2015). The attributes on the Q1 are considered the strengths of the service, attributes that are performing well and where investments should be kept equal to maintain the satisfaction level. Attributes on the Q2 contain attributes that are not important for users but still are performing strongly, that means that there is a possible waste of resources used in these attributes. Attributes on the Q3 are the ones with the lowest level of priority for investment. Finally, Q4 shown the main improvement priorities of the service, attributes that are important for users but are not performing good enough to satisfy the customers.
IPA method was recently used in the literature to compare explicit and implicit importance on transit services (Cao and Cao, 2017). Explicit importance is obtained by asking the respondent, either through a conventional rating method or some other methods, while implicit importance is derived from an OL model. They found that the priorities for service improvement based on explicit importance are different from those based on implicit importance. Aiming to verify this important finding, Fig. 4

Fig. 3. Comparison between BW (importance) and Ordered Logit models.
E. Echaniz, et al. Transportation Research Part A 130 (2019) 752-769 Importance -Performance Analysis (IPA) in which explicit importance and satisfaction obtained from BW model are compared with those obtained from the conventional method (rating and OL). Satisfaction levels are represented in the horizontal axis, while importance levels are shown on the vertical axis. Blue dots show the positions of the service attributes according to the BW data and modelling results. Orange dots show the same but using the conventional rating as satisfaction and OL parameters as importance. The values are normalized for presentation purposes. Close to 55% of the service attributes position themselves in a different quadrant when using a conventional satisfaction rating vs. best-worst rating. For some attributes, such as line coverage (LC) and information on mobile phones (IM), the quadrants are different but positions are very close to each other. For some other attributes, such as ticket fare (PR), the differences are quite remarkable in which the BW method identifies ticket fare as an important variable while the conventional OL model suggests the opposite (PR is the not important).
The differences are much greater in importance than in satisfaction. Moreover, satisfaction values are quite similar between the two methods, according to the positions relative to the horizontal axis. Most of the attributes are placed in a very close horizontal axis value, which means a similar satisfaction level. Quadrant position differences are therefore a result of different important levels identified by the alternative methods. Given that attribute importance derived from BW model is explicit (since users explicitly choose the most and least important attributes from a set of attribute) and that of the OL model is implicit (deriving from parameter estimates of an overall satisfaction model), we support Beck and Rose (2016) argument that BW scaling is better than conventional rating for defining attribute importance.
Using results of the BW methods, a change to the current fare structure may result in a higher level of satisfaction since PR is highly important but is the most unsatisfied attribute. However, Results of the ordered model, suggests that any change in the fare policy would not generate increase the overall satisfaction of the service. Such the conflicting evidence is applied also to the service frequency in which the BW method suggests improving service frequency would improve customer satisfaction while the OL method identifies service frequency as a low priority. These differences suggest that the concept of importance may differs between the two Fig. 4. Importance -performance analysis. E. Echaniz, et al. Transportation Research Part A 130 (2019) 752-769 modelling methods. We further investigate this in the next section.

Deriving attribute-specific satisfaction from BW models
An outstanding question regarding the use of BW survey method in customer satisfaction study is that whether it would be possible to obtain customer satisfaction level with each service attribute from the BW survey methods. This section addresses this question by performing a regression analysis to estimate the average level of satisfaction with each attribute based on the overall level of satisfaction and the BW model parameters. Two models are estimated: one for satisfaction level and one for the importance level of each attribute.
The BW satisfaction model shown in Section 5.1 represents the relative satisfaction level of the many attributes that together define the entire service. Thus, these parameters are best interpreted as how much more or less an average customer is satisfied with a certain service attribute, compared to the reference attribute. That challenge is to convert this relativity of satisfaction level to an average level of satisfaction for each attribute, which is usually available in the traditional rating survey with specific questions. We propose a way to supplement this information for BW method by adding a constant term to the regression model, effectively converting the BW model parameters to the satisfaction level for each of the attributes included in the BW survey. More specifically, it is the Overall Satisfaction that acts as the constant term, as it remains constant for the whole sample. The regression model is specified in equation (8) where the dependent variable is the satisfaction rating for each attribute and the independent variables are the Overall Satisfaction (OS) and the BW satisfaction model parameters.

= + SatisfactionRating
OverallSatisfaction e i k BW(Satisfaction) (8) Table 5 shows the estimation results, confirming that it is possible to accurately estimate the average satisfaction level for each of the attributes from the BW data. The model has an R 2 of 0.999, indicating that nearly 100% of the variation in attribute-specific satisfaction is explained by the OS level and BW model parameters (i.e., the relativity satisfaction of different attributes). The results suggest that conventional rating surveys can be replaced by BW scaling surveys for measuring attribute satisfaction.
The parameters associated to an Ordered model define up to a certain level the implicit importance of the different attributes. As shown in Section 5.1, the implicit importance of the OL model and the explicit importance derived from the BW MNL model are correlated, although the correlation is not as strong as in the case of satisfaction. Table 6 shows the result of the regression model defined in Eq. (9).
The goodness of fit indicators shows that the parameters of the OL model can be estimated using BW data to a certain extent. This means that there is a difference in how importance is captured in each method. In BW method, importance is considered explicit while in OL it is implicit, and thus, they are related but not the same. The parameters of an OL model encapsulate both the explicit importance and the satisfaction. In other words, the parameters of the OL model represents not only the importance that each attribute as a separate factor has within the system, but also the satisfaction with the entire system. This explains why there is such a difference in the IPA analysis shown in Section 5.2, as the OL goes further than simply defining a level of importance of the different attributes.   Echaniz, et al. Transportation Research Part A 130 (2019) 752-769

Conclusions
This study showed that the conventional method (rating and ordered logit models) and the BW methods are both suitable for analysing users' satisfaction with public transport services. Correlation and modelling analysis results indicates that conventional rating and BW methods are equivalent when studying the satisfaction levels of the different attributes of the service. The regression model shows that it is possible to reproduce the rating results by using BW data, and thus, BW method can replace the conventional rating method. This is an important finding which has a significant implication for improving the efficiency of customer satisfaction surveys and bringing positive effect to the respondents. With the BW survey methods being less time consuming and easier for the respondent to answer than the traditional rating method, this finding suggests that we can replace the lengthy and repetitive customer survey with a series of games presented as best-worst choice tasks, especially when the results reported in this paper could be replicated on different datasets and/or in different settings.
The roles that different service attributes play in explaining customer overall satisfaction turn out to be very different, depending on the modelling methods. This finding is consistent with the growing evidence on the difference between explicit importance and implicit importance in customer satisfaction studies. Specifically, important drivers of customer satisfactions obtained from the bestworst explicit importance levels are travel time, service frequency and price while these key service attributes did not come out as important factors based on the OL implicit importance levels. In this sense, the BW scaling appears to be a good indicator of attribute importance, while the OL model parameters goes further than just defining an importance level. The regression model has shown that OL model parameters are not only influenced by the importance of the attributes but also by their satisfaction level.
The Importance-Performance Analysis (IPA) offers a new way to classify service attributes according to their importance/satisfaction levels. This helps transport operators and authorities to identify the key service attributes to improve. Again, the conclusions can be considerably different and sometimes opposite, depending on the adopted method. The main differences between the two methods are ticket fare (PR) and service frequency (SE). The main differences are placed in importance levels, as satisfaction results are very similar between both methods. The importance levels derived from BW data are in line with the literature. In conclusion, the explicit importance levels obtained by using the BW method are more accurate than the implicit importance derived from the OL model. Therefore, the IPA based on BW scaling is a better indicator of which attributes should be the real priorities for operators. This finding verify results of Cao and Cao (2017), who concluded that improvement priorities based on implicit importance (OL model) were more reliable than those based on the explicit conventional rating.
Although the results of this study show that BW scaling can effectively replace the conventional rating method, several considerations should be taken into account when implementing the BW scaling method for customer satisfaction surveys. First, to obtain the average satisfaction rating of the attributes by using BW data, it is necessary to fit the regression model that connects both models, therefore, a preliminary study is required to estimate the regression model, similar to the study carried out in this article. In addition, the Overall Satisfaction of the service must be rated independently to the method used.
Regarding to future research, results reported in this paper suggest that correlations between the means of random parameter estimates are as strong as correlations between non-random model parameters; however, there are differences in the deviations of the random parameters in the sense that less preference heterogeneities were found in the BW data than in the rating data. This might suggest that BW survey method results in less noise in the data than the traditional rating survey method does. More research is required to verify this initial finding, such as using different datasets and/or parsimonious models where preference heterogeneities are segmented into systematic vs random heterogeneity. Also, different modelling approaches should be tested to fit the BW data, for example the repeated best-worst model. In addition, in an ongoing research we are investigating the extent to which customer satisfaction varies across the service levels during the whole day. Automatic Vehicle Location data is being used to identify not only the line but also the bus the respondents were on.        Echaniz, et al. Transportation Research Part A 130 (2019) 752-769