Introducing the eqasim pipeline From raw data to agent-based transport simulation

This paper introduces the eqasim framework with the aim to provide a consistent pipeline from raw data to a ﬁnal transport simulation. It therefore lays the foundation to achieve fully reproducible agent-based transport simulations. While the pathway from raw data to a generic synthetic travel demand was covered previously for speciﬁc use cases, here the general methodology is summarized. Furthermore, the tools and methods for combining MATSim simulations with ﬂexibly deﬁnable discrete choice models is described, which is the core of the existing simulation implementations of eqasim for ˆIle-de-France, Switzerland, Sao Paulo and California.


Introduction
Agent-based models have found their way in the field of transportation. Their ability to model interactions of travelers with one another and the environment, together with the rising computational power of modern computers has led to emergence of several agent-based transport simulation frameworks such as MATSim [11], SUMO [15] and others.
Initial work in the field of agent-based modeling in transportation was mainly focused on showcasing the importance and benefits of these models, while constantly improving their capabilities. However, little effort was put in documenting and ensuring reproducibility of the conducted studies.
Each case study conducted with an agent-based model mainly relies on two fundamental blocks: (1) synthetic travel demand and transport supply as input data, and (2) the simulation environment. The simulation environment

Introduction
Agent-based models have found their way in the field of transportation. Their ability to model interactions of travelers with one another and the environment, together with the rising computational power of modern computers has led to emergence of several agent-based transport simulation frameworks such as MATSim [11], SUMO [15] and others.
Initial work in the field of agent-based modeling in transportation was mainly focused on showcasing the importance and benefits of these models, while constantly improving their capabilities. However, little effort was put in documenting and ensuring reproducibility of the conducted studies.
Each case study conducted with an agent-based model mainly relies on two fundamental blocks: (1) synthetic travel demand and transport supply as input data, and (2) the simulation environment. The simulation environment can be made accessible to others by publishing the code open-source, providing good documentation and maintaining version control. This ensures the minimum requirements for any of the studies to be reproducible. However, access to the synthetic travel demand and the tools to reproduce it are also essential parts leading to the repeatability of agent-based modeling studies.
To foster reproducibility of agent-based studies, we propose a general pipeline called eqasim that provides a clear path from raw data to a final agent-based mobility simulation.
The paper is structured as follows. Section 2 gives a brief overview of agent-based simulation and models. Section 3 provides more information on the eqasim approach. Section 4 showcases one implementation of the pipeline with a model for Zurich, Switzerland. Section 5 provides discussion and concluding remarks.

Background
Agent-based models in transportation arrived as the need to model individual travelers and their interaction with the environment was rising. Their importance was evident when traditional transport models struggled to evaluate the complex interactions of users and shared mobility services i.e., car-sharing. While car-sharing was probably one of the first emerging transport services to be investigated in detail using agent-based models [16,4], shared on-demand automated services ensured that agent-based models become a tool of choice to deal with this kind of transportation services (i.e. [13,6]).
The significance of agent-based models led several research groups to develop agent-based modeling frameworks (MATSim [11], POLARIS [2], SimMobility [3], TRANSIMS [17], SUMO [15]). There are hundreds of studies conducted using these models that gave insights for many relevant topics -equity, policy, disease spreading, welfare analysis, and many others. However, the complexity of these models makes them hard and time consuming to set-up, which is one of their important limitations [14]. Once the model is set up the studies can be conducted. Yet, these studies are almost always decoupled from the process that leads to the generation of the necessary input, which makes them difficult to repeat by other researchers. This, therefore, prevents reproducibility of the majority of agent-based modeling studies, which is the second important limitation identified in the current literature on agent-based models [14].
Hence, there is a need to provide the tools that allow users to replicate and repeat research on agent-based studies. We propose a framework called eqasim that takes raw data, creates a synthetic travel demand data set, allows the user to convert the data into an agent-based model (currently MATSim) and to simulate traveler behavior.

From raw data to an agent-based simulation
That scientific research should be easily accessible and repeatable are the guiding principles of the eqasim methodology. We aim to provide a clear path from raw data to final agent-based simulation that is easily extendable, modifiable, and verifiable. Ideally, all elements on this path should be published as open-source code and data should be open and publicly available as well. This way, it would be possible for anybody interested to gather the publicly available data, run the code and reproduce a synthetic travel demand and mobility scenario that has been used in research elsewhere.
Such a process has many advantages. First, research becomes reproducible and results can be verified. While this should be the standard, it is often not possible when it comes to agent-based transport simulation. The same applies to applied planning projects, which could be performed in a more transparent way if the entire process of setting up the required simulations were open.
The path from the raw data to the mobility simulation (here referred to as the pipeline) is divided in three parts in (see Figure 1):

Simulation & Analysis
Raw data -Census -HTS -... The backbone of the pipeline is synpp 1 , a generic Python package for chaining algorithms and code pieces (stages) in a larger pipeline set-up. While it can be used in a more general context, it aims at providing a solid basis for travel demand synthesis and transport simulation applications.

Open data
The Population Pipeline part of pipeline takes raw data, performs all the necessary filtering and cleaning steps, and produces the synthetic population with activity patterns and activity locations. This process is ideally based on the openly available datasets published by government agencies: household travel survey, population census, enterprise census, commuting flows, etc. An example of generating a synthetic travel demand completely based on open data was previously described in [12], where theÎle-de-France region in France is taken as the case study. The paper describes the complete process, including all methods used, and points to the open-source code that can be used to reproduce the synthetic travel demand.
The output of this step contains the following data-sets: • households.csv that contains all households in the study region characterized with certain attributes, e.g., household size, household type, car and/or bike ownership, household income • persons.csv that contains all individuals living in the study area characterized with further attributes, e.g., age, gender, public season ticket ownership • trips.csv and trips.gpkg that contain information on all trips conducted by the individuals in plain and geolocalized form • activities.csv and activities.gpkg that contain information on all activities performed by individuals during an average day in plain and geolocalized form.
The second step in the process is to convert the synthetic travel demand to the right format for the use in an agent-based model. Currently the main implementation is a converter for MATSim [11], but first experiments with converting the data into input for SUMO have been conducted.
The third and the final step is running the mobility simulation, which will be presented in more detail in the following. eqasim provides the environment to easily integrate the converted travel demand from the second part of the pipeline into the MATSim simulation, though we add further extensions to increase compatibility with other existing concepts methods in transport planning.
MATSim simulations normally consist of three phases: mobility simulation, replanning and scoring. The recently added Discrete Mode Choice (DMC) 2 [7,8] module, however, allows to completely replace the scoring procedure with discrete mode-choice models that are executed during the replanning stage. To date, all models developed with the eqasim framework make use of this component to simulate mode choice decisions. To ease setting up a simulation with a discrete mode choice model, the eqasim-java package has been developed, which is available as open-source 3 . Besides easy-to-use interfaces and tools to set up highly customizable discrete choice models in MATSim it provides further utilities for routing trips, cutting scenarios, and more, which are compatible with and used in the overall eqasim pipeline.
The main components of the eqasim-java package are ( Figure 2): • Variables: Each alternative and person can have specific attributes stored that can be later used by utility estimators (e.g., mobility tool ownership, household income, ...). • Estimators: each mode alternative has an estimator, which calculates its utility, based on the parameters, cost model, predicted trip characteristics, and person attributes. Default estimators are available for the modes mentioned above.
The estimators quantify the utility of each mode choice alternative that is defined in the discrete mode choice extension. The extension then performs the mode choice, usually making use of the calculated utilities as part of a mutlinomial logit model. However, other formulations, such as a nested logit model, are available. The discrete choice model itself is called in the replanning phase of MATSim and applied to a randomly selected share of agents in each iteration, based on a configurable share.

An example of Zurich scenario
In the following an example of using the pipeline for a transport simulation of Zurich, Switzerland, is documented. The synthetic travel demand for the study was first used in [10] with an updated version published in [9] where the relevant data sets are documented. The complete list of data-sets and the code to generate the synthetic travel demand are also available online 4 . The pipeline makes use of census data, a detailed household structure survey, a household travel survey and additional data sets. Unfortunately, contrary to the pipeline implementation for Paris andÎle-de-France [12], not all data sets are publicly available as open data. For that reason it is currently only possible for Swiss research institutions to completely reproduce the synthetic travel demand.
The transport supply part of the model is generated using the pt2matsim [18] converter. The OpenStreetMap is used to generate the road network, and GTFS schedules are used to generate the public transport services. Finally, pt2matsim is used to map the generated transit schedules to the road netowrk.

Mode choice model
The mode choice model is a multinomial logit model which was estimated using the Swiss household travel survey and a specific travel survey for the Zurich region. The estimated choice parameters are defined inside the Mode Parameters component of the eqasim structure ( Figure 2). The utility functions of the model are defined in the Estimators component. Most inputs to the choice model are static based on a trip characteristics or travelers attributes, but they do not change over multiple iterations. The major influence is the travel time by car as it is directly dependent on the traffic assignment resulting from the detailed vehicle simulation in MATSim. The following equations define the utility functions for the modes car, public transport, bicycle, and walking: u car =β ASC,car + β inVehicleTime,car · ξ TD · x inVehicleTime,car + β work,car · x work + β city,car · x city + β cost · ξ CD · ξ CI · x cost,car u bicycle =β ASC,bicycle + β travelTime,bicycle · ξ T D · x travelTime,bicycle + β highAge,bicycle · [a age ≥ 60] u walk =β AS C,walk + β travelT ime,walk · ξ T D · x travelTime,walk u pt =β ASC,pt + β inVehicleTime,train · ξ TD · x inVehicleTime,train + β inVehicleTime,other · ξ TD · x inVehicleTime,other + β inVehicleTime,feeder · x inVehicleTime,feeder + β transferTime,pt · x transferTime,pt + β accessEgressTime,pt · x accessEgressTime,pt + β headway,pt · x headway,pt + G β ptQuality,G · x ptQuality,G + β cost · ξ CD · ξ CI · x cost,pt All β represent model parameters that were estimated while x represent per-trip input variables and a represent per-agent variables. The utility function for public transport makes a difference between routes that include a train with x inVehicleTime,train quantifying the travel time in the main stage and x inVehicleTime,feeder quantifying the rest. Only if no train is included in the route, x inVehicleTime,other has a value while the other two are set to zero. The pt quality parameters and variables are based on a methodology of the Federal Office for Spatial Development quantifying the accessibility by public transport of any location in Switzerland, based on vicinity to public transport infrastructure and frequencies of the accessible transport lines [1].
The model includes two interaction terms ξ which establish non-linear dependencies of the utility of travel time and cost on distance and one interaction term that relates household income to the perception of cost: The estimated mode parameters are documented in Table 1.
The Cost Model for private car alternative calcualtes the cost of car travel as 0.26 CHF/km based on the routed distance. The Cost Model for public transport defines the cost based on the subscription ownership. Fares for public transport are zero if the agent has an annual subscription ("Generalabo"), which is very common in Switzerland; they are also zero if the agent has a regional subscription and the origin and destination of the trip are within 15 km Euclidean distance of his or her home location (model assumption). Otherwise, the fare is calculated as 0.5 CHF/km based on the routed distance inside a public transport vehicle (thus excluding access and egress walks). If the agent has a "Halbtax" half fare subscription, the fare is reduced by half. Figure 3 shows the results in terms of mode shares by Euclidean distance for the choice model after it was integrated directly into MATSim (in light gray). As usual with this procedure, the resulting shares do not fit perfectly with the reference data obtained from the household travel survey. For that reason, two adjustments needed to be done. First, the survey used to estimate the choice model was not representative for shorter distances. Hence, we do not see a good fit for the walking transport mode. Because of that, the utility for the walking mode was modified with an additional penalty term which is close to zero for very short trips and becomes strongly negative (−100) once a certain threshold travel time is reached: Second, the alternative-specific constant for the car mode was adjusted. This is usual necessary as the interpretation of travel time in the survey data and in the simulation differ. Generally, the alternative-specific constant captures all components of the trip utility which are not described explicitly by the other terms. Hence, it can include the discomfort of parking search, paying for parking, and, in the specific case of this simulation, access and egress to the vehicle. Although recent versions of MATSim support simulating access and egress stages to and from the vehicle, this was not used here, and even then, the model does not include the locations of parking spots or garages, or any additional model of choosing between on-street parking, using a large garage or even parking facilities provided by the company for work trips. Such considerations will be important for the future development of the framework. To compensate for such effects, β ASC,car was set to β ASC,car = −0.8 after several steps of manual calibration. Finally, Figure 3 shows the fit of the calibrated model to the reference data based on the national travel survey (in black). While the reference data itself is noisy, one can see that the simulated shares follow closely the trend observed in the data.

Conclusion
This paper presented the eqasim pipeline that takes raw data and leads to an agent-based simulation after a number of sequential processing steps. The Switzerland model is used as an example. The proposed framework, which was already successfully applied to other regions, namely California [5], Sao Paulo [19], and France [12], enables reproducible agent-based transportation studies. Even though the framework itself is not the guarantee of reproducibility of the downstream studies, it provides the users with the necessary tools to achieve it.
The framework by its design is modular, which enables each of the stages to be replaced. Different methods can be used in travel demand synthesis. Different converters can be implemented to convert the demand to the input format for other agent-based models. Eventually, different agent-based models can be employed for the final simulation studies individually or in combination and comparison.