A weighted travel time index based on data from Uber Movement

In this paper, we combine data from Uber Movement and from a representative household travel survey to constructs a weighted travel time index for the Metropolitan Region of São Paulo. The index is calculated based on the average travel time of Uber trips taken between each pair of traffic zone and in each hour between January 1st, 2016 to December 31, 2018. The index is weighted based on trips reported in a household travel survey that was designed to be statistically representative of all trips made in the city during a typical business day. We show that the index has a strong correlation with traditional measures of congestion, however, with a broader coverage of the road network. Finally, we use the index to run a multivariate ex-post analysis that estimates the effect of different events on traffic congestion in the city, including holidays, public transit strikes, road shutdowns, rain and major sport events.


Introduction
Traffic congestion is a major component of transport systems efficiency. With higher levels of congestion, more time is spent in traffic and less time is available for productive activities and for leisure [1]. Therefore, congestion represents a major source of costs and inefficiency for urban economies. Because of that, measuring congestion is a key element for monitoring and evaluating transport systems and for supporting policy decisions aimed to improve them. Traditional methods for measuring congestion such as loop detectors are important tools for road segment monitoring and for traffic signal control. However their coverage is limited to the locations where they are placed, thus, these are not ideal tools for tracking congestion with a highly granular temporospatial coverage [2]. Measuring congestion at the personal trip level requires other types of tools, such as probe vehicles or taxis. However, such methods are costly and are not commonly available in developing country cities where traffic agencies face stricter financial constraints [3].
With technological improvements and the ubiquitous spread of cellphones with geographical tracking and transportation applications, other types of in vehicle data are now broadly available, creating the opportunity for the development of alternative methods to measure characteristics of transport systems with detailed coverage both in terms of time and space and with costs that represent a fraction of those required to install and maintain traditional monitoring structures. One example of this new type of data is the information compiled by the Uber Movement website, a which provides the average travel time of trips made by vehicles from Uber, a leading e-hailing company. This dataset is available for several major cities throughout the world, and it includes average travel time of trips by hour and by pair of neighborhood origin and destination.
In this paper, we explore data from the Uber Movement (UM) website and combine it with a representative household travel survey for the Metropolitan Region of São Paulo b (MRSP) to create a virtually cost-free traditional c Travel Time Index (TTI) that estimates trip delays due to congestion experienced by residents of São Paulo in every hour throughout the last three years and in almost all neighborhoods of the city. We compare this index with a traditional congestion measure calculated by the government of São Paulo, and we show that while there is a strong correlation between the measurements, our index covers a broader set of roads and is more easily translated to actual travel time losses. Finally, we use the TTI to estimate a multivariate model that evaluates the association between different types of events with congestion, identifying the direction and magnitude of those associations.
The remaining of this paper is divided as follows: Sect. 2 describes the UM data and the household travel survey used to construct the index. On Sect. 3 we explain the index and present some of its descriptive characteristics. Section 4 shows the multivariate analyses and Sect. 5 concludes.

Uber Movement
The UM project provides data about travel times of trips made by Uber vehicles in selected cities throughout the world. Cities are divided into neighborhoods according to official traffic zones or administrative boundaries, and the dataset available in the website includes the average travel time of Uber trips made between city neighborhoods at a given time interval. d That is, for a neighborhood of origin o, and a neighborhood of destination d, if there were a minimum number of Uber trips between those neighborhoods during time period p, then the website includes the average travel time t odp of these trips.
In the case of São Paulo, the metropolitan area is divided into the 517 traffic zones (TZs) defined by the 2017 Origin-Destination Household Travel Survey (OD17) carried by the city's Subway Company (described in further details later in this Section), and average travel times are available from January 1st, 2016 to the most recent finished quarter. While different levels of data aggregation can be extracted directly from the UM website, we requested from the project's maintainers all the hourly data, for all neighborhood pairs of the MRSP in all dates between January 1st 2016 to December 31, 2018. This dataset includes the average travel time observed in 370 million combinations of origin-destination-datehour. It covers almost all dates e and all hours from January 1st 2016 to December 31, 2018 (1094 dates and 26,256 hours) and includes 98,063 unique TZ pairs.

The 2017 São Paulo household travel survey
The UM data is restricted to travel times. However, our goal is to create a congestion measure that accounts for the travel patterns of residents. For example, in the early hours of the day, there is a larger flow of travelers going from residential neighborhoods to the central business district. Therefore, higher levels of congestion in those routes in the morning affect more individuals than an equivalent level of congestion in those same routes during the evening.
To account for these differences in flows, we use data from the 2017 Origin Destination Household Travel Survey (OD17) carried by the city's Subway Company. The survey interviewed 31,487 households in the MRSP between June 2017 to October 2018, and collected information about 157,992 trips made by 86,318 individuals in the business day immediately before each interview [4]. The survey was designed to be statistically representative for the population of the whole metropolitan region, so each observation included in the survey is associated with a survey weight that can be used to extrapolate results to the overall population of São Paulo.
Besides sample weights, the information from the survey used in this study includes: the TZs of origin and destination of each representative trip (the TZs from the survey are the same as the TZs used by Uber Movement), the transport mode that was used, the departure time. The survey also includes information about travelers' demographics and trip motivation [4].

The weighted travel time index
A congestion measure has the objective of allowing quantitative assessments for transportation planners and information to the general public and policy makers. Desirable characteristics of these measurements include: easiness of communication, applicability to different geographical scales, comparability to a certain standard, be measured in a continuous scale, be based on travel time data and be able to describe very congested conditions [5]. One type of metric that satisfies all these conditions and that has been commonly used both by academic works and other types of publications f are the travel time indexes (TTIs), that are calculated by comparing the travel time of a given trip relative to the expected duration of that same trip under a certain baseline, usually the free-flow conditions. Formally, the index is calculated by the formula: where the term tti odp represents the TTI of trips made between a point of origin o and a point of destination d during period p. On the right side of the equation, the term t odp represents the travel time observed for this route in period p, while t * od indicates the travel time of that same route in free flow conditions. For example, suppose that a certain trip takes on average 32 minutes during the morning peak on business days, and that this same trip takes 20 minutes in free-flowing conditions. In this case, the TTI would be calculated as 32/20 -1 = 0.6. That is, the TTI indicates that during the morning peak, this route has an average level of congestion of 60%.
A key advantage of this metric is that it has a simple and intuitive interpretation. Another important characteristic is that the indicator can be aggregated both in time and in space, allowing the analysis and comparison of different regions and different periods. In addition, the index aggregation may be weighted to consider the different travel volumes between regions at different points in time. The aggregation of the TTI can be descried by the following formula: where TTI RP is the aggregated TTI for region R over a time period P. Besides that, v odp is the number of trips observed between the traffic zones o and d that make up region R, and all periods p that compose P. V RP is the total number of trips between all traffic zone pairs within region R and all the periods that make up the aggregated period P. For example, if o and d are traffic zones, and p are hours of the day, P can be defined as a certain date, and R as a Metropolitan Region as a whole. The result of this aggregation can be directly interpreted as the average level of congestion among all trips made in region R throughout the whole period P.

Adjustments for calculating the weighted TTI using UM data and the OD17
From the elements of Equation 2, UM data includes the average travel time by hour for each pair of TZ in the SPMR (t odp ). So, to calculate the weighted TTI, we still need: 1) the free-flow time between each pair of TZ (t * od ); 2) the number of trips made between each TZ pair at each period (v odp ) and the corresponding aggregation (V RP ). Next, we describe how each of these elements are estimated in our calculation of the index.

Free-flow travel time (t *
od ) The free-flow travel time is the time that a trip would take if roads were completely free of other vehicles and other factors causing slowness, thus being a theoretical measure. While simulation methods could be used to estimate this metric for the TZ pairs included in our study, such models would require detailed information about the road network such as speed limits and the distribution of origins and destination exact coordinates of trips, both of which are not easily available. Therefore, to overcome these limitations, we estimate free-flow travel time using the own UM data and the assumption that there are periods of the day when observed travel times approximate free-flow conditions, most often during late night hours.
Based on these assumptions, we define free-flow travel time for each TZ pair as the second lowest average travel time value per hour during the most recent quarter of data from UM. The second lowest value is used instead of the first lowest to avoid selecting eventual outlier observations from pairs with fewer Uber trips. g The most recent quarter is used because as Uber expands its services, the density of data from each TZ pair also increases.
To illustrate this approach, Fig. 1 shows the average travel time by hour in a selected TZ pair in the MRSP during the last quarter of 2018. As expected, average travel times by hour are lower during late night hours and increase during peak periods. Given the procedure described above, the average travel time at 4 am is selected as the free-flow approximation for this TZ pair.

Number of trips (v odp )
In order to identify the travel flows between each TZ during each period, we use the microdata from the OD17 Survey. Although the survey is designed to be statistically representative for the MRSP, there are two important practical constraints that need to be considered in the use of this data to calculate the weighting of the TTI: 1 The number of TZ pairs observed on UM is not constant; there are days and times when travel time information between a given TZ pair does not exist on the platform because the number of trips made with the Uber application is not enough for the inclusion of the average travel time in the dataset. The Survey includes information about 157,992 trips made in a typical working day, however, there is a total of 6.4 million combinations of TZ pairs × hours (517 TZs × 517 TZs × 24 hours), therefore the Survey data is not sufficiently dense to inform about travel flows in the same disaggregation level as the travel times from UM. Regarding constraint 1, the main issue is that, with the expansion of Uber's activities, a direct comparison of congestion levels between distinct periods is not necessarily valid. If the TTI is calculated without considering the differences in composition over time, the results of the analysis may be biased. For example, it is possible that the pairs observed in 2016 are mostly in the more central regions of the MRSP where congestion levels are naturally higher. On the other hand, it is possible that in the case of the observations from 2018, the proportion of suburban TZ pairs become larger. In this case, even if the congestion levels were unchanged in the MRSP, a TTI that does not consider the composition difference between the periods would indicate a congestion drop over time.
To work around this problem, the solution is to keep constant the TZ pairs included in the index construction. We selected the pairs observed in at least three quarters of the days in the analyzed period. With this criterion, 23,807 pairs were selected, and they represent 28.8% of motorized trips in the MRSP according to according to the OD17. Figure 2 shows that although the selected pairs correspond to only 8.9% of the total possible combinations between pairs of zones, they cover practically the entire urbanized area of the MRSP, except for some of the most isolated districts.
As for the second constraint, that is, the non-statistical representativity of the OD17 Survey at the level of TZ pairs by hour, the workaround was to approximate the weighting calculation using a higher level of temporospatial aggregation. Specifically, the MRSP was divided into nine macro-regions and the weights were calculated based on the number of trips observed in the OD17 between pairs of macro-regions during different periods of the day h rather than the number of trips between each pair of TZ by hour, thus ensuring the statistical representativity of the patterns observed between each pair used for the weighting. The macro-regions used in the study were defined according to Fig. 3.
Finally, to ensure that the proportion of trips between each macro-region is constant even if the number of TZ pairs change over time in the UM data, an adjustment factor was added to the TTI formula in order to maintain constant the total weight of each macroregion pair.
where vŌD p is the total number of trips observed in the OD17 Survey between the OD macro-region pair during period P, such as o ∈ O and d ∈ D. In addition, NŌD p is the total number of TZ pairs that make up the macro-regionŌD and which have travel time information in period p on the UM platform. This adjustment ensures that even if the number of TZ pairs fluctuate throughout time, the TTI results will always correspond to the static aggregated traffic volumes observed in the OD17. That is, if a pair of macro-regions contains 10% of the total travel flows in the OD17, then the sum of the weights of the TZ that make up that macro-region pair will always be equal to 10%, and this equivalence will be independent of the number of TZ pairs observed in the UM data in a given day. Therefore,

TTI descriptive results
Based on the formula described in the previous sub-section, the TTI was calculated for all dates from January 1st, 2016 to December 31, 2018. For each day, the index was also calculated for each period of the day and for each macro-region of the MRSP. The average value for the total TTI was of 34.88%, indicating that on average, trips made in São Paulo are 34.88% longer than if made on free-flow conditions. i Fig. 4  On Panel C, we show the dynamics of congestion within the week. On weekends, the mean TTI is well below average (15.6% on Sundays and 22.2% on Saturdays). On week- and for the whole MRSP. On Panel D, late night corresponds to 12 am-7 am, morning peak to 7 am-10 am, midday to 10 am-4 pm, evening peak to 4 pm-7 pm and night to 7 pm-12 am days, there seems to be an increasing pattern, with Friday being the most congested day (46.2%). Panel D shows the dynamics of congestion within the different periods of the day. The afternoon peak, defined as the period between 4-7 pm, has the highest average TTI (55.4%

Total time spent due to congestion
Based on the TTI and the OD17 data, it is possible to estimate with a simple back-of-theenvelope calculation the average time lost due to congestion in a typical weekday in São Paulo. As shown in the previous section, the average TTI on weekdays is equal to 41.2%, and according to the OD17, in a typical business day, 3.62 million individuals travel by car in São Paulo spending a total of 5.47 million hours on these trips. Therefore, if we divide this total number of hours by one plus the average TTI, we have that those same trips would spend only 3.87 million hours if they were always made under free-flow conditions. That is, congestion causes individuals who travel by car in São Paulo to spend, on average, 26.4 more minutes than if all trips were performed under free-flow conditions.
As pointed by Litman (2009), [1], expecting all trips to be performed in free-flow conditions is not something reasonable, especially in dense urban environments. However, the objective of this type of calculation is not to set up a target of potential time savings. Instead, the most important value of this estimation is to translate into easily understandable metrics the potential outcomes of different scenarios. For example, we can calculate that the 6.1 percentage point reduction in the TTI between 2016 to 2017 corresponds to an average reduction of almost 3 minutes per day for all individuals who travel by private cars j in the MRSP.

Comparison with traditional congestion measures
To further validate the TTI calculated in this paper, we compare it with a traditional congestion metric that is currently calculated by CET, k which counts the length of congestion on selected roads by hour. The CET measure is calculated based on the evaluation of technicians in official vehicles or positioned at the top of buildings. Only roads with the largest volume of vehicles are measured, corresponding to just over 800 km of the 17,000 km of the total roadway network in São Paulo l [6]. Figure 5 shows the scatter plot of our TTI on the y-axis against the CET congestion measure on the x-axis. To make the values comparable, we restricted the TTI to the same  5 regions that are used in the CET measure, and we also restricted the analysis to business days. Therefore, each point in the plot represents the value for both measures by region of São Paulo city and by day. The figure shows a clear and strong positive correlation between the 2 measure (0.873). However, it is worth noticing that the TTI values are less likely to be equal or very close to zero, mainly because the road network is not restricted to certain roads as in the case of the CET measure.
Next, we present on Fig. 6 the separated plots for each region of the city. The results of these analyses show that most series have a high degree of correlation (above 0.7). The only exception is the North Zone, where the correlation coefficient is equal to 0.499, which can only be considered a moderate positive association between the series. One possible explanation for this difference is the more restricted spatial coverage of the CET indicator in the North Zone.
The high correlation between our TTI and the traditional CET indicator serves as a validation for the measure developed in this study, encouraging its use for analyses that extrapolate the space and time limitations of the CET congestion indicator. While the series are similar, it is important to highlight a key difference between the indicators: the CET measurement is based on the level of traffic flow slowness observed in the main avenues of São Paulo, meanwhile, the TTI is calculated from actual travel times of real trips, therefore it already accounts for drivers routing optimization and adjustments to real traffic conditions. Thus, in addition to having a simpler and more direct interpretation, the TTI results can be directly translated into travel costs.
In the next section, we explore and present additional and more detailed examples of analyses that can be performed using the TTI results that were calculated in this section.

Analysing the impact of different events on traffic congestion
There are several factors that can affect the level of congestion in a city like São Paulo, including climatic events, strikes, political demonstrations, festivities, historical trends and other atypical events. Given the granularity of the time series of the congestion index, it is possible to estimate and compare the impact of different events in the city traffic through a multivariate analysis. In the period of our analysis, by way of illustration, we identified the following factors that could potentially affect São Paulo traffic: Rainy days; A national truck drivers' strike between May 21 to May 31, 2018; The 2018 FIFA World Cup (especially when the Brazilian national team was playing); School holidays; The closing of part of Marginal Pinheiros due to the collapse of a bridge (Nov. 15,2018).
To evaluate the impact of each of these events, we estimate the following multivariate regression model: where: The results of this model estimation are presented in Fig. 7: During the 2018 World Cup, when the Brazilian National was not playing, the average congestion was not much different than usual, averaging 1.4 percentage points below ex- pected (a nonsignificant result at 5% significance). However, on the days of Brazil's games, the index was on average 11.2 points below usual.
Regular holidays are associated with a congestion reduction of -20.7 percentage points and holiday bridges have a similar association (-17.3 points). During school holidays, congestion is 9.1 points lower. During the truck drivers' strike, the circulation of vehicles was greatly reduced due to the lack of fuel at filling stations. So not surprisingly, the congestion index during the strike was 8.9 points lower than usual. In the days following the closing of Marginal Pinheiros due to a bridge collapse, there was an increase of approximately 4.1 points in the TTI. Finally, on days with average rainfall above 0.1 mm per hour, congestion is on average 4.2 points higher than on non-rainy days.
As already noted in Sect. 2, the congestion patterns are heterogeneous throughout the days of the week. The multivariate regression estimated here confirms the same patterns. The reference group are Mondays. Therefore, the results indicate that the other days of the week present an average value of increasing congestion, with the highest value observed on Fridays (9.5 points higher than Mondays). On weekends, the TTI is well below the reference group, respectively -16.2 on Saturdays and -22.4 on Sundays. Finally, the year specific slopes show that while there was a significant decrease of 10 points in 2016, the TTI remained mostly stationary during 2017 and 2018.

Conclusion
As in most large metropolises around the world, traffic congestion is one of the greatest problems faced by the residents of São Paulo, generating economic and welfare losses for residents and for visitors. A first step for addressing the problem of traffic congestion is to measure, monitor and understand the phenomenon. However, one difficulty faced by technicians and researchers, particularly in developing world cities, is the lack of largescale quantitative measures with high temporospatial granularity.
The TTI built in this study based on data from an e-hailing company aims to provide a new tool for congestion analysis that is virtually free and could potentially be extended to other cities and to other types of analysis. The indicator constructed here differs from traditional measures used in São Paulo by being based on actual travel time from real trips rather than road-based metrics. Because of that, the indicator directly reflects travel time costs while accounting for route adaptation and optimization. Additionally, the index created here suggests a framework for integrating UM data with household travel surveys in order to create a weighting scheme that makes the index results to reflect the travel patterns and average delays experienced by drivers.
Still, the framework proposed in this paper is open to technical improvements and adjustments, such as refinement of free-flow metrics and travel weighing. A natural extension of the framework presented here would be to go beyond a traditional averaged Travel Time Index and to use the integrated data to calculate reliability measures such as a Planning Time Index. We also acknowledge that the multivariate analysis estimated here doesn't fully take advantage of the temporospatial granularity of the index. Therefore, the results presented here represent an initial outline of the possible applications of the index, and other studies using and improving the indicator, as well as exploring the Uber Movement data, are highly recommended. It is hoped that such tools and studies will be used to objectively inform the urban mobility debate and to assist policy makers in their decisions.