Forecast and Early Warning of Regional Bus Passenger Flow Based on Machine Learning

. This paper mainly forecasts the short-term passenger ﬂow of regional bus stations based on the integrated circuit (IC) card data of bus stations and puts forward an early warning model for regional bus passenger ﬂow. Firstly, the bus stations were aggregated into virtual regional bus stations. Then, the short-term passenger ﬂow of regional bus stations was predicted by the machine learning (ML) method of support vector machine (SVM). On this basis, the early warning model for regional bus passenger ﬂow was developed through the capacity analysis of regional bus stations. The results show that the prediction accuracy of short-term passenger ﬂow could be improved by replacing actual bus stations with virtual regional bus stations because the passenger ﬂow of regional bus stations is more stable than that of a single bus station. The accurate prediction and early warning of regional bus passenger ﬂow enable urban bus dispatchers to maintain eﬀective control of urban public transport, especially during special and large-scale activities.


Introduction
Urban public transport is a traffic mode to alleviate traffic congestion and make efficient use of road resources. To realize intelligent dispatching of buses, it is important for decision makers to know well the change law of bus passenger flow and accurately predict the passenger flow in the short term. Burst passenger flows cause a huge amount of traffic demand in a short time, which may bring great pressure to public security. Early warning of passenger flow is required to have some preparation by the administration. However, the urban bus stops lack real-time early warning tools with good accuracy nowadays. e short-term bus passenger flow is affected by various random, complex, and space-varying factors. In areas with large passenger flow, it is difficult to forecast the short-term change of bus passenger flow. Currently, short-term traffic is generally predicted by traditional statistical methods, novel intelligent methods, and hybrid methods.
Originating from time series analysis in the 1980s, the traditional statistical methods are built on the data collected by manual surveys. Over the years, these methods have been evolving and intellectualized in the context of traffic flow prediction. In the forecast of short-term passenger flow at a bus station, the entry and exit volumes mainly come from the card swiping records of the automatic fare collection (AFC) system. Taking such continuous data as a time series, many models have been designed for traffic flow prediction based on statistical principles, including autoregressive (AR) model, moving average (MA) model, autoregressive integrated moving average (ARIMA) model, and seasonal ARIMA (SARIMA) model [1][2][3][4]. In addition, many have introduced k-nearest neighbors (k-NN) [5], nonparametric regression [6,7], and Kalman filter [8] to predict short-term traffic flow.
Recent years have witnessed the emergence of many hybrid forecast methods, most of which are combinations of novel intelligent methods such as NN and ML. For instance, Ke et al. [25] fused the FCL Net into a new deep learning (DL) method for the projection of short-term passenger demand. Xiao et al. [28] developed a new hybrid forecast strategy for air transport demand, which couples singular spectrum analysis (SSA), adaptive network fuzzy inference system (ANFIS), and optimized particle swarm optimization (OPSO). Sun et al. [29] combined wavelet transform (WT) with SVM into a hybrid prediction model for passenger flow; the model decomposes, predicts, and reconstructs the data on passenger flow in three stages and inherits the merits of both WT and SVM. Tan et al. [30] put forward a total traffic flow prediction method based on NN, MA, exponential smoothing (ES), and autoregressive MA (ARIMA).
In addition, Hinton and Salakhutdinov [31] applied DL to solve short-term prediction. Hu et al. [32] optimized the parameters of support vector regression (SVR) through particle swarm optimization (PSO), introduced historical momentum to reduce the impact of noise in traffic flow data, and then established a PSO-SVR model for the forecast of short-term traffic flow. Dogan [33] designed the periodic outstaring and prediction (PCP) algorithm, adopted the algorithm to improve the training set of artificial neural network (ANN), and proved that the improved ANN could predict short-term traffic flow based on selected clusters. Bagloee et al. [34] proposed a hybrid ML-based method to solve the bilevel optimization problem. Han et al. [35] derived a hybrid, optimized LSTM from Nesterov accelerated adaptive moment estimation (Nadam) and stochastic gradient descent (SGD).
To sum up, the traditional statistical methods mostly treat the current traffic state as a linear combination of the previous states and errors. On the upside, the traffic flow can be predicted simply by mathematical statistics, with relaxed data requirements. On the downside, the traditional methods fail to reflect the randomness and nonlinearity of traffic flow, consume too much manpower and financial resources in data acquisition, and have a low accuracy in the prediction of traffic flows, especially that with sudden changes. e novel intelligent methods are better than the traditional methods in data fitting and prediction accuracy, but the computing process is much more complex. e hybrid methods generally consider the features of actual traffic flow. e prediction accuracy of these methods varies with the coupled algorithms. Compared with the intelligent methods, the hybrid methods are highly complicated. e advantages, disadvantages, and applicability of common short-term prediction methods are shown in Table 1.
Despite the abundant results on short-term forecast, only a few scholars have explored the prediction or early warning of the passenger flow at urban bus stations. Gong et al. [36] proposed an ARIMA model and a Kalman filter to predict the number of passengers waiting at a bus station. Han et al. [35] created a hybrid and optimized LSTM to project the bus passenger flow. Van Oort et al. [37] converted the IC card data to the number of passengers per line, constructed an origin-destination (OD) matrix between stations, and assigned the matrix to the network to reproduce the measured passenger flow. Kumar et al. [38] developed a bus travel-time prediction method that considers both spatial and temporal variations in travel time. Wu et al. [39] built a convolutional LSTM (ConvLSTM) model with a self-attention mechanism, which accurately predicts the travel time on each segment of a trip and the waiting time at each station. Considering the small size, strong time-variation, and extraction difficulty of shortterm passenger flow at bus stations, some scholars have taken account of connected and autonomous vehicles to alleviate the variability of travel time [40,41]. Albeit these efforts, it is still difficult to make realistic forecast of the short-term passenger flow at bus stations. Wang et al. [42] designed a new framework to solve the problem of sudden passenger flow early warning. Pereira et al. [43] detected overcrowding with a threshold-based method and defined the point whose arrivals exceed the 90% percentile as overcrowding point. Bai et al. [44] monitor passenger flow to display the distribution trend of real-time passenger flow with GIS technology. To sum up, the current research mainly focuses on the early warning of traffic flow, and there is little research on the monitoring and early warning methods of passenger flow at bus stops.
We firstly analyze the features of passenger flow at bus stations and propose a novel concept called regional bus station, which is aggregated from actual bus stations in this paper. en, the SVM was introduced to predict the shortterm passenger flow at regional bus stations. e results show that the prediction accuracy of short-term passenger flow could be improved by replacing actual bus stations with virtual regional bus stations. On this basis, we designed an early warning model for regional bus passenger flow, which monitors the passenger flow in important areas during the period of special activities (e.g., large events) and takes control measures in advance to ensure the smooth progress of these activities. e remainder of this paper is organized as follows: Section 2 puts forward the concept of regional bus stations by analyzing the IC card data of Shenzhen from November 7 to December 4, 2016, and summarizes the features of regional bus passenger flow; Section 3 selects and trains the prediction variables and employs the SVM to make shortterm prediction of the passenger flow at regional bus stations; Section 4 analyzes the accuracy of prediction results; Section 5 carries out the capacity analysis and derives the early warning model for regional bus passenger flow; Section 6 puts forward the conclusions.

2
Mathematical Problems in Engineering

Bus Station Data and IC Card Data.
e IC card used in this paper is Shenzhen Tong. It is a kind of stored value card for consumption by Shenzhen bus and Shenzhen Metro, which is manufactured under the supervision of Shenzhen Transport Bureau and issued by Shenzhen public transport settlement management center. e bus data only displays the valid information such as the user card number, card swiping time, and bus license plate number. It is shown in Table 2.
We extracted the GPS data and the card data of passengers and mapped the bus stops according to the timespace relationship. e GPS data of 28 days from November 7 to December 4, 2016, are used in this paper, which contain about 541,115,294 pieces of time and location information of 10,314 vehicles. erefore, we can study the characteristics of working day and non working bus passenger flow. e passenger flows of bus stations in Shenzhen were acquired from the IC card data during the bus operation time (6:00-22:00) from November 7 to December 4, 2016. Figure 2 shows the distribution of the number of card swipes on buses in Shenzhen on a typical day.
To pinpoint the bus line number of each card swipe, the geographical positioning system (GPS) data were matched with the location data of bus stations, as shown in Figure 3.
rough the matching, the bus arrival time was obtained for each bus station. en, the boarding station of each card user was identified by acquiring the boarding time and bus line number from his/her IC card and matching the bus line number with the arrival time obtained in the previous step. According to the time-varying distribution of bus trips in many days, it is found that the bus trips have obvious peak characteristics, and the peak hours are concentrated at 7:00 a.m. and 18:00 p.m. 7:00-9:00 and 17:00-19:00 are selected as the morning and evening peak hours of bus travel, as shown in Figure 4.

Regional Bus Station.
To facilitate the passenger flow prediction of all bus stations in a region of Shenzhen (E: 113.76°-114.62°; N: 22.45°-22.87°), the road network in the region was meshed into 3,612 1 km × 1 km grids. 933 grids were found to have bus stations. In this way, the 53,914 bus stations in the region were aggregated into 993 regional bus stations (the black spots in the lower part of Figure 5).

Short-Term Passenger Flow Prediction at Regional Bus Stations
Principles. e short-term regional passenger flow was predicted in the following steps: Step 1. Encode all the grids, and count the number of stations in each grid.
Step 2. Taking 1 h as the time window, count the number of passengers boarding buses at each station from 6:00 to 23:00 in the four days from December 1 to December 4, 2016.
Step 3. Allocate the data in four time windows (19:00-20:00, 18:00-19:00, 17:00-18:00, and 16:00-17:00) of the first 3 days (December 1-3, 2016) to the training set and the data in the same time windows of the last day (December 4, 2016) as the test set. Train the data by the ML-based forecast procedure. Step 4. Select the features of short-term regional bus passenger flow through ML, producing a set of valid features R.
Step 5. Select the top N features that influence the target period the most from the feature set R, and take them as the input of the regression model for SVM-based prediction.    the training data of the lst day, d is feature dimension, x l is the set of passengers boarding on the lst day under feature d, and y l is the test data of the lst day. Details of alternative features are shown in Table 3. e alternative feature for a target grid is the 1 km × 1 km grids around that grid, and the alternative feature set for the 993 grids in the region is whether the target grids are surrounded by 1 km × 1 km grids. If there are i grids in the set, then the number of boarding passengers should be counted in the previous j time windows of each grid in the set. erefore, the feature dimension can be expressed as d � i × j. e number of alternative features varies from grid to grid, that is, the d value is a variable in X (l×d) � (x 1 , x 2 , . . . , x l ) T . Here, the value of j is set to 3. e details of alternative features are presented in Table 3.

Feature Training. Let
Let T be the target time window. en, the N � 3 most important features were selected through recursive feature elimination (Algorithm 1).

SVM-Based Regression Prediction.
e regression prediction takes the N features that influence the target period the most as the inputs. us, the dimension of sample x i is equal to N.
Let D � (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x l , y l ) , x ∈ R N , y ∈ R be the training set. en, the training samples were mapped into a higher dimension by function Φ, turning nonlinear regression inputs into high-dimensional inputs for linear regression. e nonlinear mapping function can be expressed as . en, the SVM-based nonlinear regression can be defined as where C is the penalty parameter and ε is the error term. e optimum solution can be calculated as α ( * ) � (α 1 , α * 1 , . . . , α l , α * l ). e positive subsector α j > 0 of α or α * j > 0 of α * was chosen. en, the b can be calculated by en, the decision function can be described as For target grids, the number of passengers boarding in time window t can be forecasted based on the same N most important features on the given day. e radial basis function (RBF) was chosen as the kernel function:

Verification of Time Window.
e target time window for prediction was set as 1 h. To verify the correctness of the time window, the real number of passengers boarding in 19: 00-20:00 on December 4, 2016, was selected as the current period and compared with the real number of the previous period (18:00-19:00), the real number of the subsequent period (20:00-21:00), and the predicted value of the current period (19:00-20:00) ( Figure 7).
As shown in Figures 8 and 9, the points were scattered, and the degree of linear fitting was poor, indicating that the number of passengers at each station in the grid fluctuates significantly within the time window of 1 h. Hence, it is necessary to set 1 h as the time window.
Moreover, the regression result of Figure 8 was β 1 > 1 (β 1 is the slope of the linear fitting line), that is, the result is greater than the actual boarding number in 19:00-20:00 of all grids. e regression result of Figure 9 is β 1 < 1, indicating that the actual boarding number in 19:00-20:00 of all grids is more than that in the following hour. erefore, the data in the current time window cannot be replaced by the data of the previous or subsequent time window. In addition, in Figures 6 and 7, the slope β of linear fitting approaches to 1, indicating that it is better to use 1 hour as the time window.
is further confirms the rationality of the 1 h time window.

Verification of Grid Size.
We cross-grained the road network into 3,612 square grids with an area of 1 × 1 sq. km. Within the partitioned 3,612 square grids, 993 had at least one bus station inside. We cross-grained the road network with an area of 0.5 × 0.5 sq, and we found that 10% of the area had no bus stops. Moreover, the division of regional bus stations should be combined with the average length of urban road sections, and the area enclosed by each road section should be taken as the regional bus station as far as possible, and the average length of the road section in this paper is 0.78 km. In addition, we forecast the passenger flow Table 3: e details of alternative features.
Mathematical Problems in Engineering 5 of regional bus stations with different flows in different regions. We find that, with the increase of passenger flow of regional bus stations, the prediction accuracy increases, as shown in Figure 10. erefore, it is appropriate to select 1 × 1 sq as the size of regional bus stations.

Model Test.
It can be directly inferred from Figure 6 that the actual value and the predicted value follow a linear relationship. us, unary linear regression was adopted for data fitting: where β 0 and β 1 both obey the normal distribution. If β 0 approaches to 0 and β 1 approaches to 1, then the model has a high degree of regression. e dataset to be regressed was defined as (x i , y i )(i � 1, 2, . . ., n, n � 993). Next, the linear regression problem was solved by the least squares (LS) method. e results of time window 11:00-12:00 were β 0 � 4.20 with its confidence being (2.42, 5.94) on the level of 95%, β 1 � 0.90 with its confidence being (0.885, Input: training set D � (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x l , y l ) , x ∈ R d , y ∈ R and linear regression model For the predicted results on the time window 11:00-12: 00, the MAPE, MAE, VAPE, and RMSE were 0.18, 35.16, 0.12, and 65.42, respectively. e results of time window 19:00-20:00 is that β 0 � 2.23 with its confidence being (− 0.08, 4.55) on the level of 95%, β 1 �0.96 with its confidence being (0.95, 0.97) on the level of 95%, and coefficient of correlation R 2 � 0.98. e MAPE, MAE, and RMSE of these results were 0.16, 26.29, and 54.29, respectively. e R 2 values of the two time windows demonstrate the significant correlation between variables X and Y. In both time windows, the β 0 value approached to 0, and the β 1 values approached 1. e error metrics of the results on the two time windows were both satisfactory. erefore, the proposed ML method is favorable for predicting bus passenger flow in the short term.

Comparative Analysis of Prediction Accuracy between
Single Bus Stop and Regional Bus Stop. In order to analyze the prediction accuracy of single bus stop and regional bus stop, we select the window of the world regional station, in which there are three bus stops: world window stop, Baishizhou stop, and Meilu Jinyuan stop. Bus passenger flow of November 28 is predicted, and the prediction results are shown in Figures 11 and 12. e accuracy of regional bus station prediction results is significantly higher than that of single stops, especially for stops with relatively little passenger flow, the prediction accuracy is obviously low, and the prediction results have no practical application value.

Model Construction.
e passenger capacity of a regional bus station can be calculated by where C is the capacity of the regional bus station and P j is the residual capacity of the regional bus line. According to the Traffic Engineering Manual, the capacity C i of station i can be computed by where (g/c) i is the green light ratio of the intersection in front of station i; R is the adjustment coefficient reflecting the degree of impact from bus arrival time and boarding time on station capacity (the empirical value is 0.833); D is the average boarding/alighting time (the value is generally 20-50 s); t c is the average clearance time of bus station, i.e., the time required for the former bus to depart from and the current bus to arrive at the same position at the station (the value is generally 9-20 s); m is the total number of regional bus stations; N i is the effective berth on the ith station. e residual capacity P j of regional bus line j can be calculated by where S j is the load factor of line j at acceptable service level (%), i.e., 80% of the rated passenger capacity; f j is the departure interval of line j (min); H j is the capacity of single bus online j (person) (the value is generally 60 persons); Q j is the actual capacity of line j (%). e early warning coefficient K can be calculated by where f(x) is the predicted short-term passenger flow of a regional bus stop and α is the proportion of IC card swiping passengers in the total number of bus passengers in the region.  Mathematical Problems in Engineering degree of alarm is issued. In this paper, the early warning method adopts the method of combining the early warning index and the warning limit interval. If the value of the early warning index is in the corresponding warning limit interval, it corresponds to the alarm of this degree. Combined with the setting method of traffic flow warning interval [45], considering the level of public transport service, that is, considering the passenger comfort, reliability, and safety of operation service [46], the early warning interval of regional bus stops is divided into four levels. e first-level warning interval corresponds to the first level of service. e overall full-load rate of regional public transport vehicles is extremely low, and the passengers' comfort is very high, and the safety factor is very high. e second-level warning interval corresponds to the second-level service level. e overall full-load rate of regional public transport vehicles is not high, the waiting time of passengers is small, the congestion is low, and the safety factor is relatively high. e third level of warning interval corresponds to the third level of service, and the public transport vehicles are at the edge of full load. At this time, the safety factor is low, and the bus is crowded, which represents the general safety level; the fourth-level of warning interval corresponds to the fourth level of service, the safety factor is very low, the departure frequency cannot meet the needs of passengers, and the bus congestion is high, which represents the danger. e warning coefficients of the four levels are shown in Table 4.

Value of Early
If K falls in (0, 0.75), the region is safe; if K falls in (0.75, 0.9), the region needs to be monitored; if K surpasses 0.9, the region must be alarmed.

Case Analysis.
e case analysis targets the grid at the junction of Shennan Avenue and Qiaocheng Road  Figure 11: Prediction result between single bus stop and regional bus stop. ( Figure 13). Shennan Avenue is one of the busiest roads in Shenzhen. Involving 16 bus stations and many bus lines, the target grid is highly representative of the public transport in Shenzhen. As shown in Figure 13, there are 35 bus lines at Kongjia Group Station on Shennan Avenue alone. e proposed SVM-based algorithm was adopted to train and learn the historical boarding number of stations in the target grid and predict the daily number of boarding passengers of 20 days (6:00-23:00). en, the predicted results were compared with the real values in the grid ( Figure 14).
It can be seen from Figure 15 that our algorithm achieved a good prediction effect, as the predicted results followed the same trend as the real values. e MAE and MAPE were small, and the RMSE was smaller than 10%. rough capacity analysis, the capacity of regional bus stations in 6:00-24:00 was 4,113 person times/h, while the residual capacity of bus lines was 3,840 in 6:00-7:00, 5,760 in 7:00-9:00, 4,032 in 9:00-17:00, 5,760 in 17:00-19:00, and 4,860 in 19:00-24:00. e regional bus capacity and K value in each period are presented in Figure 15.
According to the K values in Figure 15, the early warning coefficient in the target grid was 0.9 in 18:00-19:00. Hence, in this period, the bus capacity in the region is saturated and needs to be monitored. To reduce the K value, it is necessary to increase the regional bus capacity by stepping up departure frequency, improving the capacity of some stations, and expanding the effective parking spaces at stations.

Conclusions
is paper analyzes the IC card swiping data of all buses in Shenzhen during November 7-4, 2016 (6:00-22:00), and introduces the ML method of SVM to predict the short-term passenger flow of urban bus stations. e main conclusions are as follows: (1) e training samples of candidate feature sets and the weight of each feature were through linear regression. e N features with the greatest impact on the target period were selected as the input of the regression model. e SVM-based regression prediction was adopted to predict the bus passenger flow  in the target time window. e model achieved high prediction accuracy when the time window is 1 h. e MAPE, MAE, and RMSE of these results were 0. 16, 26.29, and 54.29, respectively.
(2) e capacity and early warning coefficient K of regional bus passenger was analyzed in detail. According to the K values, the capacity of some bus stations and bus lines in the target region should be improved to further optimize the service of public transport. (3) e concept of regional bus stop is put forward in this paper; a suitable short-term prediction method for the passenger flow of regional bus stops is constructed, and the classification method and early warning coefficient of regional bus stop service level are developed.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.