Open data from taxis and Bluetooth detectors to extract congestion and mobility patterns in Thessaloniki

Thessaloniki hosts one of the largest mobility living labs in Europe, aiming at fostering innovation to the mobility sector. Data is a key aspect of the living lab, allowing to depict mobility and congestion patterns to better manage traffic and support decision making. Most of the public and private stakeholders of the Thessaloniki mobility eco-system are part of the living lab and provide real-time data to the host of the living lab (CERTH-HIT), receiving added-value services from their participation. Thus, structured and unstructured Transportation and Mobility related datasets generated by various both conventional and innovative data sources, namely floating taxis and Bluetooth detectors, are being processed into “Thessaloniki's Smart Mobility Living Lab”, the data analysis and modelling laboratory of the Hellenic Institute of Transport (HIT). As most datasets are usually generated by high-rate and high-density machines, an intricate and efficient back-end pipeline is in place to ensure the proper collection, transformation, combination, and processing of such datasets in almost real time. In addition, many static datasets are kept and updated regularly.


Specifications
Business, Management and decision sciences Specific subject area Transportation management using multi-source data to analyse the current status of network and provide services to the citizens and stakeholders. Type of data Real-time and historical datasets in JSON, XML, CSV, KML, MAP standards containing speed measurements and trip trajectories providing the sequence of locations or the origin and destination together with the relevant timestamps. How the data were acquired The data described were acquired: • via proprietary APIs that transmit data captured by the Bluetooth sensors installed in various traffic intersections throughout the city. The (anonymized) captured data are transferred via UPD protocol using the mobile network infrastructure (GPRS). • via access to APIs provided by the collaborative taxi association that provide the location data of the floating taxis, captured using the OBUs (Android tablets) capabilities.

Value of the Data
• Floating car data from taxis containing speed and map-matched location and Bluetooth detection data are used to extract congestion patterns. These are used to support traffic management in the city by being an umbrella above the three technology providers and allowing the city to coordinate their effort s. • Floating car data containing location and vehicle status (with customer or not) and Bluetooth detection data are used to generate mobility patterns, which are used to feed transport modelling and support the city in decision making related to mobility (e.g. these were used to define the network of micro-mobility stations in the city, including location and size but also expected rebalancing needs). • All these datasets are used by OR and AI algorithms to optimize mobility systems (e.g. location of stations, areas for on-demand services)

Data Description
The datasets contained in the repository are described below. In Table 1 the datasets described and the relevant data fields are summarised.  The path in polyline format

fcd-gps
This dataset contains anonymized floating car data collected from a taxi fleet of over one thousand vehicles that operate in the city of Thessaloniki. The information available from this dataset is the location of the vehicle (longitude, longitude), the speed, the altitude, the orientation, and the timestamp the data was recorded. This dataset is available in JSON, XML, CSV, KML and MAP format and is updated in almost real time. A historical dataset, updated every month, is also available in TXT (tab-delimited text file).

itravel-detections
In this dataset, the aggregated number of detections per iTravel Bluetooth device is stored. The fields included are device id (that corresponds to the id of the dataset itravel-devices described below), number of records and timestamp. A historical dataset (itravel-detectionshistorical) in TXT format is also available and updated monthly.

network-speed
This dataset contains the current speeds of the road network in Thessaloniki as calculated using the fcd-gps dataset. OpenStreetMap (OSM) road segments are used for the identification of the segments. The dataset contains the Link_id (OSM road segment), Link_direction (the direction of the road segment), timestamp, speed (calculated) and uniqueEntries (number of FCD used for the calculation). This dataset is available as JSON, XML or CSV and is updated ev-ery 15 min. Lastly, a historical dataset (network-speed-historical) is also available in TXT (tabdelimited text file) format and updated monthly.

network-congestion
Similarly, to the network speeds, this dataset is also calculated using the fcd-gps dataset. As the name suggests, this dataset depicts the congestion in the road network of the Thessaloniki. It contains the fields: Link_id (OSM road segment), Link_Direction (direction of the road segment), Timestamp, Congestion (the level of congestion -low, medium, high). It is available in JSON, XML or CSV format and is updated every 15 min.

itravel-traveltimes
This dataset is produced using the Bluetooth detections (itravel-detections dataset). It is updated in almost real time and contains the path id (corresponds to the id in the itravel-paths dataset below), the timestamp and the duration. Historical dataset (itravel-traveltimes-historical) is also available in TXT format and updated monthly.

fcd-traveltimes
This dataset is similar to the travel-traveltimes dataset described above. In this case the FCD data are used to calculate the travel time for the predefined paths. It is available in JSON, XML and CSV format alongside the historical dataset (fcd-traveltimes-historical).

datafusion-traveltimes
This dataset is produced using both the iTravel Bluetooth detections and FCD data to calculate the current travel times for the selected paths. It is available in JSON, XML and CSV format. A historical dataset (datafusion-traveltimes-historical) is available in TXT format and updated monthly.

itravel-devices
This dataset holds the static data of the Bluetooth detection devices' characteristics. Namely, the fields contained in the dataset are the device id, the device name, and the location (longitude, latitude).

itravel-paths
This is a static dataset of the geolocation of the predefined paths in Thessaloniki between the iTravel Bluetooth devices. It contains the path id, the path name, the ids of origin and destination Bluetooth id and the coordinates of the path using a polyline. It is available in XML, JSON, CSV, KML and MAP formats.

Experimental Design, Materials and Methods
The data described in the previous section were collected through various sources, namely detectors placed in strategic locations and floating cars (specifically taxis) that traverse the city of Thessaloniki. The methodology of assessing and cataloguing these datasets is described in [ 1 , 2 ]. Furthermore, relevant urban mobility indicators can be generated as presented by Mistakis et al in [3] . All the open datasets provided in the repository are anonymized and strictly follow the GDPR guidelines.

Floating Car Data
This dataset ( fcd-gps ) contains anonymized floating car data collected from the biggest taxi fleet association (Taxiway) consisting of over one thousand vehicles that operate in the city of Thessaloniki. Every moving vehicle produces and transmits 1 GPS record per 10-12 s or 100 m of movement. In average nearly 2 thousand new FCD records per minute are being generated while the dataset is updated in almost real-time.

Bluetooth Detections
Over 43 Bluetooth detection devices are placed throughout the major road junctions of the city of Thessaloniki. The itravel-detections dataset holds the aggregated detections of each Bluetooth device at a specific timeframe. This data is derived from detection records of MAC addresses of devices that are in range of the detector and are in discoverable mode. The captured data are subsequently anonymized and transmitted to the server via the UPD protocol using the mobile network infrastructure (GPRS).

Network Status
Utilizing the FCD and Bluetooth detections datasets through a series of calculation and stateof-the-art methodology, valuable information on the status of the network can be produced, namely the average speeds and congestion. The network-speed dataset provides accurate and real-time traffic information in Thessaloniki, Greece by estimating the average moving speed of the vehicles on the road network. The speed estimations are produced every 15 min although it is possible for this frequency value to change in the future. Initially, all the FCD records are processed appropriately to remove any data that would allow unauthorized or voluntary user identification and are filtered to remove any erroneous entries with extraordinary speeds or unaligned coordinates as well as entries generated by faulty GPS receivers. A specific algorithm then, that considers the network topology (monographs, types of roads, etc.), each record (point) is mapped to the part of the road network to which it is most likely to belong with the highest degree of certainty [4] . Lastly, for each section (segment) of the road network, proper statistical analysis is performed to provide a safe estimation of the average speed at which vehicle traffic is conducted on it. Similarly, the network-congestion dataset provides traffic information in a qualitative (low, medium, high) manner based on the speed calculated for the network-speed dataset described above. Specifically, the calculated speed is divided with the free flow speed of the road segment and depending on a threshold the qualitative value is produced. It is important to notice that OpenStreetMap (OSM) road segments are used for the identification of the segments where the Link_id contained in these datasets correspond to the wayId used in the OSM architecture.

Travel Times
Similarly, to the network status datasets, the travel times datasets are calculated by feeding the FCD and Bluetooth detections (as well as a combination of both) through in-house developed algorithms [5] that output the duration in seconds for predefined paths in Thessaloniki. These paths are stored in the static dataset itravel-paths. The origins and destinations of these paths are defined by the locations of the Bluetooth sensors that are stored in the static itravel-devices dataset. The datasets itravel-traveltimes and fdc-traveltimes use as input the FCD and Bluetooth detections respectively in order to calculate the travel time. Furthermore, the outputs of both methodologies are combined in an effort to increase the overall system's output accuracy and credibility ( datafusion-traveltimes) . All three datasets are updated every 15 min.

Ethics Statements
All described data in this paper strictly follow the GDPR guidelines for personal data. No publicly available data can be traced back to an individual. This submission follows the ethical requirement for publication in Data in Brief.