Data-Driven Prediction System of Dynamic People-Flow in Large Urban Network Using Cellular Probe Data

1Ph.D., Data Scientist, Ford Motor Company, 22000 Michigan Ave, Dearborn, MI 48124, USA 2Ph.D., Data Warehouse Engineer, GlobalFoundries, 400 Stone Break Rd Extension, Malta, NY 12020, USA 3Ph.D., Research Associate, TOPS Laboratory, University of Wisconsin-Madison, 1415 Engineering Drive, Room 1217, Madison, WI 53706, USA 4Ph.D. Machine Learning Researcher, BMW Technology Inc., 540 WMadison St Suite 2400, Chicago, IL 60661, USA 5Traffic Engineering Consultant, TranSmart Technologies Inc., 411 S Wells St, Chicago, IL 60607, USA 6Ph.D., Research Associate, TOPS Laboratory, University of Wisconsin-Madison, 1415 Engineering Drive, Room 1249A, Madison, WI, 53706, USA 7Ph.D., Vilas Distinguished Achievement Professor, TOPS Laboratory, Department of Civil and Environmental Engineering, University of Wisconsin-Madison, USA


Introduction
Dynamic people-flow in this paper refers to the estimated number of people moving into or out of a zone, which reflects the real-time travel demand.Due to the trend of increasing urbanization shifts, people-flow monitoring data has become an essential source of information for decisionmaking in urban planning, urban disaster and emergency management, and urban roadway operations.More broadly, dynamic people-flow can provide critical decision-making insights applicable to all industries, such as targeting a specific audience for advertisements or selection of optimal store location.
Traditionally, urban people-flow is estimated using a 4step method based on survey data, which is both labor and capital intensive, and also gets updated infrequently.Some studies processed video data collected from single or multiple closed-circuit television (CCTV) cameras, which provide high accuracy people-flows in real-time.However, this method is impractical to apply to a large network because of the infrequency of installed surveillance cameras.Some other studies using passive data collection methods feature GPS, Bluetooth, or a social media network.However, they are suffering due to sample size limitations, and the bias of user attributes.
Cellular network operators collect cellular probe data of mobile phone users daily.In recent years, thanks to the rapid development of cellular communication technology, most people in developed and developing countries own mobile phones.For instance, as of June 2016, 77.3% of the total population in China owns at least one mobile phone [1].The anonymous mobile phone traces record the location of mobile phone users when they text, call, connect to the Internet, and even passively when the mobile phone communications to the cellular network.This provides an opportunity to study and monitor human activity using cellular data, but there are still some issues with deriving people-flow from cellular data.
The first issue is that the update frequency of the mobile phone user's location relies on the mobile phone activity frequency, which is not uniformly distributed temporally or spatially.Temporally, people usually use their mobile phones more frequently during the day than during the night.Spatially, some people use their mobile phones more frequently near work while others use their mobile phones more frequently at home.Therefore, it is hard to use a statistical method to estimate the real-time people-flow based on the people movement detected by cellular probe data.The second issue is the efficiency of data processing and model calibration, since the cellular probe dataset is extremely large.
To address the issues above, a machine-learning based data-driven system is designed to predict the grid-based inbound/outbound people-flow.The study area is divided into square grids, which integrate multiple data sources as the input features for the machine-learning model.The inbound/outbound flow of each grid is estimated with realtime cellular data that is aggregated into 5-minute increments as the real-time people movement feature.To calculate the model, individual trajectories were inferred by a trip-chain model and integrated 5-minute people-flow for each grid.Random forest method is used in the data-driven system result from the performance in processing a large dataset.The proposed data-driven system predicts the inbound/outbound people-flow of each grid for 30-minutes into the future.
The rest of this paper is organized as follows.Section 2 reviews the current state of analysis for people-flow estimation, the existing studies on cellular probe modeling, and recent studies on data-driven methods using passively collected data.Section 3 presents the methodology for this paper, including the grid-based data integration model, the trip-chain based individual trajectory inferring model, and the machine-learning based data-driven model.Section 4 presents a case study on a real network in a large-scale urban area.Section 5 summarizes this paper and discusses the result.

People-Flow Estimation.
Conventional methods for people-flow estimation are usually derived from data collected by survey, roadside detectors, surveillance video, and other passive data collection methods.The conventional travel demand between each pair of traffic analysis zone is inferred from the city Original-Destination matrices, which are estimated from the citywide survey.The survey is usually expensive and updated only once every five years.A classic four-step regional survey forecasting model is able to estimate and predict the people-flow at a large scale [2,3].Beside the survey data, recent studies showed that there are several others methods capable of deriving OD matrices from emerging passive data collection methods, such as traffic count data, vehicle plant matching data, GSP data, and social media data [4][5][6][7][8][9].The traffic count data and vehicle plant matching data rely on the data collection infrastructure, which is costly, requires maintenance, and usually specific to freeway networks.The GPS-based OD derivation method has lower cost and higher accuracy, but suffers from issues including limited sample size and coverage area, sampling bias, and privacy concerns, which is why it is not widely used for OD estimation [10,11].The social media service as a data source for human activity studies also suffers from the sample size and sample bias issues [12].In a study of Origin-destination demand in a large-scale network, the real-time OD demand is estimated and predicted with a data-driven method using real-time demand data in Korea.Three strategies of implementing the features for the k-nearest neighbor algorithm are compared and presented [13].Cellular data is also widely used in the field of trip distribution estimation, traffic state estimation, and traffic flow monitoring in freeway networks [14][15][16][17].

Data-Driven Approaches.
In the last 20 years, the datadriven approach has been applied to the field of intelligent transportation system (ITS) and improved the efficiency and performance of ITS [18].The data-driven approach refers to the algorithms which are compelled by data, rather than the model driven method.It solves the problem progression in an algorithm compelled by data, while the traditional methods depend on human experiences and historical data.Taking advantage of the widely deployed ITS sensors and multiple real-time enabled data sources for individuals, vehicles, and roadway networks, the real-time data-driven based ITS system would improve the accuracy and efficiency of conventional ITS systems [19].The method has been widely used in many subjects of current ITS systems.Some of the studies work on the short-term travel time prediction on freeway networks using speed and traffic count data [20,21].The data-driven based dynamic simulation approaches have been studied using real-time traffic data to estimate roadway traffic volumes across various time intervals [22].A dynamic data-driven approach is applied to the surface transportation system [23].Benefiting from increasing data volumes and computing power, the data-driven approach has been widely applied on transportation systems.In summary, the studies applying the data-driven approach in the field of transportation have been focused on travel time prediction, traffic statue monitoring, and travel demand estimation.Few studies have investigated the feasibility of applying the data-driven approach for people-flow prediction to a large-scale area for real-time service using cellular signaling data.

Dynamic People-Flow Prediction Framework
3.1.System Architecture.This paper described a data-driven based online people-flow prediction system, as shown in Figure 1.The system contains four modules.
Module a: Cellular Probe Data Preprocessing Module.This module processes the real-time cellular data and stores the preprocessed cellular data.

Module b: Grid-Based Data Transformation and Integration
Module.This module integrates the multiple data sources as the attributes of each grid.Input features are generated to train the machine-learning model in module d.
Module c: Trip-Chain Based Human-Daily-Trajectory Inferring Module.This module provides the daily trajectories of each mobile subscriber.By integrating the trajectories, the people-flow (inbound/outbound) of the grids could be estimated as the labels for the machine-learning model in module 4.

Module d: Machine-Learning Based Online People-Flow Prediction Module.
This module uses a random forest model for offline learning using the input feature from module b and input label from module c.Real-time cellular data is the input of the online prediction model.

Cellular Probe Data Preprocessing Module
3.2.1.Cellular Probe Data.Cellular network operators collect the location of cellular network subscribers for the billing and operational purposes.The location is not a highly accurate user location but a virtual location represented by the user-connected base station (BS).Each BS has a corresponding coordinate and a unique combination of cell identification code (CI) of BS and location area code (LAC) of the connected location area.The cellular data will be stored in the database by mobile switching center (MSC).For each row of cellular data, it includes LAC, CI, timestamp, and event type.The GSM network signaling data of mobile subscribers is stored in a two-level hierarchy database, home location register (HLR) and visitor location register (VLR).
Because of the nature of cellular phone communication, the preprocessing algorithms should be applied to generate more accurate data with fewer redundancies.Because user location is based on the location of cellular records, the update frequency and temporal coverage of the cellular data is critical in this study.Based on sample data of Shanghai from one of the major cellular carriers of China, in Figure 2(a), more than 75% users have 20 or more records per day.Figure 2(b) shows the time for each user's first and last record.It shows that most of the users have the first record earlier than 6 AM and last record later than 10 PM.
The event of cellular probe data includes two basic types:   (LU) and  (HO).The LA processing can be triggered in the flowing conditions: mobile phone is on, the mobile phone moves from one location area to the other location area, and a periodic location update occurs generally once per hour.Handover (HO) is triggered when the mobile phone is in communication status and travels from one BS to an adjacent BS.Both BSs are recorded in the cellular signaling database when HO is triggered.Thus, when a mobile phone user makes a phone call and has a trip through several base stations, a series of timestamped with estimated locations will be tracked.Each mobile phone user has a unique mobile station ID (MSID).

Cellular Probe Preprocessing.
Based on the attributes of cellular phone data, the raw data will generate whether the mobile phone is moving or stationary.In the data-driven system, the quality of input cellular probe raw data is critical.There are three types of errors defined below, that will be processed in this module.
Definition 1 (duplicated data).Duplicated data is three or more pieces of continuous raw data with same MSID, Cell ID, and LAC.The processing procedure shows in Box 1.
Definition 2 (Ping-Pong switching).When the mobile phone moves to the edge of the cellular coverage area, the connection to the current BS becomes weak as the signal from the adjacent BS grows stronger.In this case, the mobile phone will terminate the connection to the current BS to connect the new BS.However, the signal attenuation and the BS-cellphone distance are not linearly changing.So, at the adjacent boundary of the two cellular coverage areas, the mobile phone may be covered by multiple BSs, with the signal intensities of each BS being similar.In this case, the cellular phone may switch the connection between two BSs even if it is stationary, shown in Box 1.
Definition 3 (drift switching processing).Occasionally, during the current process of cellular data, the mobile phone can switch to a BS which is very far from the previous BS and then switch to another BS near the first BS.The reasons that drift switching gets triggered are complex and unpredictable.The major reasons of drift switching are BS signal blocking and unstable antenna environment, shown in Box 1.
where Δd = distance between two points, Δt =time interval between two records, Lat, Lon = the latitude and Longitude of row i of MSID u, R = earth radius.Box 1. shows three data preprocessing algorithms.The preprocessed data is the input of module 3 and module 4.

Feature Integration Module.
The data-driven system predicts the fine graded-based inbound/outbound people-flow for real-time service.The module input datasets are multiple grid attributes and the cellular raw data.The module output is the generated features for the data-driven model.It is because the population flow patterns for each particular area are highly related to the attributes of that area.For instance, the subway line may have larger people-flow than the vacant area.In this paper, the study area is divided into squares, which represents the "grid" in this paper.The data sources are integrated into the grids in the study area.

Point of Interests (POI) Features (𝐹 𝑝 ).
A point of interest is a specify location that serving a particular purpose, such as restaurants or hospitals.Each POI has a coordinate, a name, a category, and address.The POI categories in this study include hotel( 1 ), school( 2 ), government( 3 ), bank ( 4 ), hospital ( 5 ), market and mall( 6 ), restaurant( 7 ), stadium( 8 ), transportation hub( 9 ), and factory( 10 ).For each part of the grid, the number of POIs will be calculated by POI category.

Transportation Network Features (𝐹 𝑟
).The transportation network in this paper refers to the major road network ( 1 ), light rail network ( 2 ), and the subway network ( 3 ).Since there is a strong correlation between a transportation network and the people-flow, the links of the transportation network are mapped on each part of the grid.

Temporal Features (𝐹 𝑡
). Beside the grid-based features, there are some other features that may influence the dynamic changes in people-flow.There are two binary features in this study: peak hour ( 1 ), Work time ( 2 ), and night time (T3).

People Movement Level Collection from
Real-Time Cellular Data (  ).The real-time people movement in this study refers to the sequence of mobile phone user locations inferred from the cellular raw data in a 5-minute time interval.There are two major events of cellular signal transition event: (1) Location Update and (2) Hand Over should both be correlated with the grids.Due to the nature of cellular data, the hand over event can locate the mobile phone more accurately than the location update.
Location Update.Based on the attributes of raw cellular data, the coverage area of each cellular tower could be calculated by Voronoi graph.Spatial join analysis is used to calculate the percentage of the Voronoi graph mapping on each grid.
where X = the study space, it is study area,  The result of the Voronoi in a study area is shown in Figure 3(a).Each dot is the location of BS.Each polygon around the dots is the cellular coverage area.
The contribution rate of each BS to the grid is shown in Figure 3

Daily Individual Traveler Trajectory Estimation Module.
This study uses a random forest (RF) model as the datadriven model.The features are the training data set, which were acquired in the previous section.In this section, the validation data set is calculated using the daily cellular data.The proposed transportation mode shares driven model in this study is the combination of a trip-chain based microscopic mode choice model and a model transportation shares aggregating process.The mode choice decision of a mobile user within one day for every trip within is the output of the mode choice model at the individual level.The tripchain based rules reflect the temporal-spatial and private vehicle usage constraints within one day.Then the mode choice results of the individual mobile phone users are aggregated with the characteristic to obtain the transportation mode shares at the macroscopic level.The daily individual trajectory should be inferred.

Inferring Individual Stays and Travels.
A rule-based model is used for the home location detection and activity inferring.Figure 4 shows the stay and trips.
Home Location Detection.Mobile phone users are classified to the daytime-active users and the nighttime-active users to apply the home location detection process separately.If the user stays in a zone between 12:00 AM and 8:00 AM for sequential days, the user is classified as a daytime-active user.Otherwise, he/she is classified as a nighttime-active user.Then, the home location detection rules are set as follows: for the daytime-active user, the most frequently pinned station during 12:00 AM to 8:00 AM is set as the representative home location of the user; for the nighttime user, the most frequently pinned station between 8:00 AM and 12:00 AM is set as the representative home location of the user.
Activities Inferring.After getting the home location, the activities of the mobile phone user are extracted by inferring the Potential Stays.The location update data and phone bill data are both included in the following inferring process.A Potential Stay point is identified by a sequence of consecutive mobile phone records bounded by both spatial and temporal constraints as shown in Figure 4.The spatial constraint is the roaming distance between the first and the last record in a stay location.The roaming distance should be related to the distance between base transceiver stations in that area.For example, the roaming distance in the Shanghai central area is set as 520 meters by considering that the average distance between the neighbor base transceiver stations is 260 meters in that area.The temporal constraint is the required minimum duration stay in a stay location.In this stay, only mobile pin records satisfy the spatial constraint, and duration greater than 30 minutes qualifies as a potential stay.
The activates of the mobile phone user are correspondingly extracted from the final stay detection results.The stay location and stay duration are imported as the feature of the day.If the daytime-active user stays in the same location for more than seven hours between 8:00 AM and 6:00 PM in a day, the stay location is marked as the work location, and the relative activity is marked as a work activity.The land use data could help with inferring the activity purpose.
Travel Detection.After the stay points are gradually detected, the connection between the stay points is the travel of the mobile phone user.The combination of the phone bill data and the location update data could record the movement trace between the origin and destination.There are two situations: (1) when the origin and destination location are different, the travel is the connection between the two activates; (2) when the origin and destination are the same locations, composing a trip-chain, the furthest pinned position is set as the Stop-By point for the trip.Then there are two travels for this connection.One is from the origin to the Stop-By point, and the other is from the Stop-By point to the destination.The travel distance and travel time are recorded based on the broken line connecting the sequential pinned position.

Extracting Trip-Chains.
Trip-chain for the mobile phone users is composed in the previous section.With the activity and trip-chain theory, the typical trip-chain modes of the travelers are presented in Figure 5.The home-based trip-chains of mobile phone users within one day could be classified by: the simple home-based chain, complex chain, containing subchain, and multihome-based chain.In Figure 5(d), for example, trips 1 and 4 are the main chain and trips 2 and 3 are the subchain.If the travel of a user cannot compose a trip-chain, the travel is treated as a trip separately in the latter mode choice step.

Travel Model Detection and Trajectory Map Matching.
In this study, the travel mode detection of the mobile phone user is at the trip-chain level.The mobile phone user's travels with a private vehicle and public vehicle are significantly different.The nonprivate vehicle user could change between nonprivate travel modes freely.Considering the rapidly growing usage of the private car in developing countries, the accuracy of the mode choice for the first home-based trip is critical for step 3. Two assumptions are made in this step.
Subway Trips.The subway mobile stations have been labeled as "subway station" or "underground lane" in the GSM network.Because each of the subway lines has a unique location update code, the mobile phone user will connect LAC of the current subway line.A mobile phone trajectory with subway mark T= (.The nearest path from could be generated.From the point to point, the subway link with the right subway line should be selected.If select multiple subway line, the nearest link should be elected as the starting point.From the subway network, the traveled trajectories could be inferred from the starting link and ending link.
Highway Trips.Map the individual trajectory on the highways.The highway solution is, if the trip-chain based travel mode selection flagged a trip as highway trips, the Dijkstra would be used to find the best highway-based routes.A set of possible routes is restricted to a corridor to estimate the area where the mobile phone subscribers would able to travel.A shape-file map of the study area which contains the roadway links and edge points Nonhighway Trips.The trips are not on the freeway for nonmotor travel mode.In this case, the trajectory treated as a straight line.The starting point and the ending point of the straight line are the connected BS location.

5-min Grid People Flow at (t + 30min)
Figure 6: The work flow of data-driven process.

Data-Driven Process.
There are two parts of the datadriven process shows in Figure 6.A classification random forest (RF) machine-learning model is used as the datadriven model in this paper.There are two parts of the datadriven process: the first part is the offline learning, which calibrates the RF model using the historical data.The second part is the online inferring, which calibrate the real-time cellular probe data to predict the grid-based people-flow.

Random Forest Model.
The RF model is the major datadriven model we use in this paper.The RF algorithm firstly proposed by Breimen in 2001, which is so-called ensemble method, a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution of all trees in the forest [24].
In this study, the random forest approach is the major classification method.For each tree, the training data sets with selected features and the related features are required in each of the trees.In the tree build procedure, attributes will be split in each of the nodes in a tree from the top level to the leaf level.An entropy index is used to determine the best features in each of the nodes in where E is the entropy of each feature, n is number of values in each of the features, p is proportion of class I.
Each of the individual tree classifiers results will be collected for voting.The most popular results will be the RF output result.The randomization approach is based on two parts: bagging and random selection process.In the first part of the RF method, a bootstrapping process, which is the training data will be selected randomly for each of the tree training, is used for the tree generation.The features for each of the tree trainings are also selected randomly to replace the existing features for each tree.
In the second part of RF method, the RF will be build up by undertaken the trees.The importance of each feature will be measured from the total data set.Then, the permuted data set will be used in the development and model refine.The mean decrease accuracy index (MDAI) will be calculated for each of the features.The variable is of importance of each feature (  ) based on the calculate in where (x j ) is importance of attribute x j , E  is error rate before the permutation process, E   is error rate before the permutation process for feature j.  6.
Because of the nature of each zone, the maximum number of people-flow is different.For example, the grid in transit area people-flow may reach 20,000 people per hour, but in some community area the people-flow will less than 100 people per hour.Since the RF model is calculated based on the weights of each feature, the extremely wide data distribution will affect the accuracy of the result.In this case, the peopleflow estimated by 5-minutes time interval and the peopleflow estimated by daily trajectory is divided into 6 levels.
The number of each level of the flow is the max number of flow in each grid divided by 5.If the input estimated flow is larger than the regular maximum flow, the flow will be assigned as class 6.By multiple the predicted level with the level interval for each grid, the number of people-flow will be estimated.
Since the people-flow within an area may change because of the date type.For example, the central business district attracts fewer people-flow during the holiday than that during the work day.For better accurate, the model is calibrated based on the holiday type.In first day and last day of the holiday and last day of the holiday, the people-flow may have distributed differently.
The model is calibrated based on the date shows in Equ.(7).
where   = the date types, which contains 5 key features of date,  ℎ = holiday or not,  ℎ  = date in holiday,   = weekday.
In the online part, real-time cellular data will be integrated into a 5-minute time interval.The right model is selected based on the date of cellular data.The people movement will be mapped on each of the grid.Thus, the online inferring modeling calibrates the cellular probe data and output the predicted people-flow.

Case Study
4.1.Data Source.The study area covered 1,000 2 , with more than 20 million people in the coverage area.The area was divided into 10,000 grids (100 * 100 grids).Each grid is a square with 600-meter side length.There are four data sources available in the study area.
(i) Cellular raw data: the data is collected by one of the top three major cellular carriers in Shanghai from November 27th to November 29th of 2013 (Wednesday to Friday).3.1 million mobile phone users are extracted to test the proposed system and validate the models.There are 6.7 billion pieces of cellular data from the three days to test the proposed system.
(ii) Transportation Network data: the dataset includes network links of the subway network and major highways.
(iii) POI data: the POI dataset in the study area was collected in 2015.There are 3696 hospitals, 6395 schools, 4436 hotels, 3499 government agencies, 34495 markets, and 21928 restaurants in the study area.

Prediction Results
Feature Evaluation and Selection.It is the critical process in machine-learning modeling, which selects a subset of the relevant features as the input for modeling.There are four features and six figure combinations evaluated in this section.Because the model is calibrated in real-time, the real-time people-flow (  ) and temporal feature (  ) should be primary features.Table 1 shows the combination of the primary features and two secondary features: Transportation Network Feature (  ) and POI Feature (  ).The RF result from six feature-combination scenarios are listed in Table 1 Based on the result, with more feature data set into the RF model, both precision and recall are improved.The recall improved less because the category of flow data is divided equally.The number of records for each category is not uniformly distributed.

Conclusion and Discussion
Zonal inbound and outbound people-flow is a major output of travel demand modeling.It is a critical data source for transportation planning, operations, and management and is usually estimated by travel surveys and GPS data.The travel surveys take tremendous labor and capital resources, so it is usually only taken every 3-5 years.Additionally, GPS data, including cellphone GPS or vehicle GPS, usually has a low sample size, which makes it hard to reflect the people-flow for a whole population.The cellular signaling data, which can be passively collected in real-time at low cost with a high sampling rate, has great potential to improve upon the weaknesses of GPS data and survey data.However, because cellular data is temporally and spatially sparse, few of the previous studies focused on extracting the real-time people-flow using cellphone signaling data.
This study presents a data-driven based people-flow prediction system.The benefits of the proposed prediction system are the efficient and accurate real-time people-flow prediction service.Since the cellular signaling data in a largescale network is extremely large, the calibration efficiency for the real-time service is critical.The proposed trip-chain model provided a possibility of identifying the missing trips and calibrated the people-flow in real-time.A grid-based data integration module is used for data integration and feature extraction.Multiple data sources, including POI features, temporal features, real-time people movement level features, and the transportation network features, are integrated into a grid-level system.In this way, the model calibration process is efficient because the calibrated model could be applied on all grids with different attributes.
The online inference RF model with four types of features provides precision of 76.8% and 70% for outbound and inbound people-flow, respectively, which are much higher than the results of a single-feature prediction model.Hence, the data-driven approach in this paper using an offline training model and an online inference model is able to predict the people-flow in a real-time, efficient, and accurate way.

Figure 1 :
Figure 1: System architecture of people-flow prediction system.

Figure 2 :
Figure 2: Preliminary of cellular data temporal coverage.
d = the distance function, B k = Voronoi area k,   = The set associated with B k .

4 )
(b).The hexagon represents the coverage area of BS.For grid A, the contribution rate is a%, which is calculated in Coverage percentage ratio = BS coverage area grid area (Hand Over.Hand over location should be the middle point between each pair of overlapping BS coverage areas.The coordinate of the middle point at the boundary of cell coverage areas is calculated as a handover point, which is shown in Figure 3(b).Each hexagon represents an estimated coverage area of a cell tower.Each dot in Figure 3(b) is the calculated HO location in the overlapping area.The calculated individual movements are aggregated in a 5minute time interval for each cellular phone as the people movement feature.

6 :Figure 7 :
Figure 7: Cooperation of based people-flow data and predicted result.

Table 1 :
RF prediction result evaluation.