Prediction of Train Arrival Delay Using Hybrid ELM-PSO Approach

In this study, a hybrid method combining extreme learningmachine (ELM) and particle swarm optimization (PSO) is proposed to forecast train arrival delays that can be used for later delay management and timetable optimization. First, nine characteristics (e.g., buffer time, the train number, and station code) associated with train arrival delays are chosen and analyzed using extra trees classifier. Next, an ELM with one hidden layer is developed to predict train arrival delays by considering these characteristics mentioned before as input features. Furthermore, the PSO algorithm is chosen to optimize the hyperparameter of the ELM compared to Bayesian optimization and genetic algorithm solving the arduousness problem of manual regulating. Finally, a case is studied to confirm the advantage of the proposed model. Contrasted to four baseline models (k-nearest neighbor, categorical boosting, Lasso, and gradient boosting decision tree) across different metrics, the proposedmodel is demonstrated to be proficient and achieve the highest prediction accuracy. In addition, through a detailed analysis of the prediction error, it is found that our model possesses good robustness and correctness.


Introduction
With the rapid development of society and the continuous improvement of people's quality of life, people have put forward higher requirements for the reliability and punctuality of high-speed railway transportation [1]. However, the train will inevitably be disturbed by a large number of random factors in the process of running, which will lead to the train delay. For one thing, train delay will change the structure of train diagram, increase the cost of railway operation and the difficulty of reasonable utilization of transportation resources, and have a great negative impact on the reliability and punctuality of high-speed railway operation. For another, it will increase the travel time of passengers, affect their travel plans, and bring serious inconvenience to passengers [2]. erefore, accurate forecast of train delay is of great significance for high-speed train operation organization, transportation service quality improvement, and operation safety [3]. e traditional models are a classical approach for train delay prediction, such as probability distribution models [4,5], regression models, event-driven methods, and graph theorybased approaches. For the probability distribution model, Higgins and Kozan proposed an exponential distribution model, which applied a three-way, two-block station train delay propagation signal system, to estimate delays of trains caused by train operational accidents [4]. rough the assessment of the linear relationship between several independent features and dependent features [2], regression models were widely employed to predict train delays, dwell times, and running times [6,7]. However, the main drawback of regression models is that the ability of linear analytic model relies much on internal and mathematical assumptions. ey are good at capturing the linear relationship between features and dealing with low-dimensional data, not simple linear data, such as train operation data [8]. For event-driven and graph theory-based methods, Kecman and Goverde [9] used timed event graph with dynamic weight to predict train running time. Milinković et al. [10] used fuzzy Petri net to predict the train delays; the model considers the characteristics of the hierarchical structure and fuzzy reasoning to simulate the train operation and predicts the train delays in different delay scenarios. Huang et al. [8] used graph theory to calculate the degree and propagation range of train delay under specific condition. Although the traditional model started early in the study of delay prediction, it generally has the limitation of poor generalization performance and is only suitable for specific scenarios.
Recently, the application of machine learning methods to predict train delays has been widely concerned by researchers, which makes up for the shortcomings of traditional methods [2]. e purpose of Peters et al. was to utilize the historical travel time between stations to predict the train arrival time more precisely [11]. e moving average algorithm of historical travel time and KNNs of last arrival time algorithm were employed and estimated. Some researchers are devoted to ANNs to predict train delays [12][13][14]. e aim of [14] was to propose preeminent ANNs to predict the train delay of Iran railway with three different models, including standard real number, binary coding, and binary set encoding inputs. Nevertheless, the prediction accuracy of ANNs cannot meet the needs of actual delay management, and the parameter adjustment is complex. Marković et al. [2] proposed a support vector regression model in train delay problem of passenger train, which captured the relationship between the arrival delay and a variety of changing external factors, and compared it with the artificial neural networks. e results indicated that the support vector regression method outperformed the ANNs. Another neural network has been proposed in recent years. A Bayesian network model for predicting the propagation of train delays was presented by [15]. In view of the complexity and dependence, three different BN schemes for train delay prediction were proposed, namely, heuristic hill-climbing, primitive linear, and hybrid structures [16]. e results turned out to be quite satisfying. Recently, it has become popular to combine several models to capture various characteristics of train operation data to predict train delay. A study developed a train delay prediction model, which combines convolutional neural networks, long short-term memory network, and fully-connected neural network architectures to solve this issue [17].
To improve the backpropagation algorithm and simplify the setting of learning parameters of general machine learning models, the ELM algorithm was proposed by Bin Huang et al. [18]. ELM has the advantages of small computation, good generalization, and fast convergence. On account of these advantages, ELM has been frequently applied to regression problems in the real world [19][20][21][22]. erefore, a new study that combined a shallow ELM and a deep ELM tuned via the threshold out technique was employed to predict train delays, taking the weather data into account [23].
Parameter adjustment is another critical factor to guarantee the good performance of machine learning models [24,25]. Although the well-known random search algorithm can achieve the purpose of optimization, it generates all the solutions randomly without considering the previous solutions. An adjusting parameter model, called PSO, has become one of the widely used parameter adjustment methods because of its ability to address intractable matters in the real world. Only the optimal particle of PSO transmits the information to the next particle in the iterative evolution process. As a consequence, the searching speed of PSO is faster than random search and grid search [26]. e experiment [27] did just prove the advantage of PSO. By comparing the performance of PSO with random search algorithm for the optimal control problem, [27] found out that PSO was capable of locating better solution with the same number of fitness function calculations than random search algorithm. erefore, according to what the author has learnt, we propose PSO to optimize the hyperparameter of ELM to forecast train arrival delays. e contributions this paper makes are as follows: (1) e main features affecting the train delay prediction are evaluated by the extra trees classifier. en, the proposed model is constructed based on these features which possess spatiotemporal characteristics (train delays at each station). In this way, the interpretability of the proposed model is improved. (2) e proposed model is applied to the arrival delay prediction of trains on HSR line, which suggests a brand-new perspective for the train delay prediction problem. In addition to solving the drawbacks of backpropagation algorithm, the advantage of ELM-PSO is also to solve the arduous problem of manual regulating the hidden neurons of ELM better than random search and Bayesian optimization at accuracy and efficiency. (3) We perform experiments on a section of the Wuhan-Guangzhou (W-G) HSR line. e proposed model not only is compared to other two adjusting parameter models, but also is contrasted with four prediction models from different perspectives. Our model turns out to have an extraordinary ability in managing large-scale data in accuracy. e remainder of this paper is distributed as follows: in Section 2, the train delay problem and selection of characteristic features are described. e hybrid ELM-PSO approach is introduced in detail in Section 3. e data description and experimental settings are presented in Section 4. e performance analysis is discussed in Section 5. Finally, conclusions are presented in Section 6.

Description of the Train Delay Problem
Train delay problem is visualized in Figure 1 to assist in comprehending this abstract problem. e train delay contains two contents, train arrival delay and train departure delay. For a station s n , t Asn represents the time that one train is scheduled to arrive at station s n and the same goes for t Dsn , which implies the time that one train is scheduled to depart at station s n . Certainly, the train will have its own actual timetable due to changing external factors, which are expressed as t Asn ′ and t Dsn ′ , respectively. e difference between the actual and scheduled arrival time at station s n , t Asn ′ − t Asn , is referred to as the train arrival delay. e same goes for the train departure delay t Dsn ′ − t Dsn . is is the primitive description of the train delay problem. is paper only focuses on the train arrival delay prediction. We suppose that there is an aimed train T k , which is at present station s n at time t Asn . Our purpose is to predict the arrival delay (t Asn+1 ′ − t Asn+1 ) of the targeted train T k at its following station s n+1 for all conditions according to the information of train k at stations s n , s n+1 , and s n−1 , which is made up of the following nine features: (1) e station code (X 1 ) (2) e train number (X 2 ), which indicates the number of the trains (3) e length between the present station and the next station (D sn+1 − D sn ) (X 3 ) (4) e scheduled running times between the present station and the previous station (t Asn − t Asn−1 ) (X 4 ) (5) e actual running times between the present station and the previous station (t Asn ′ − t Asn−1 ′ ) (X 5 ) (6) e scheduled running times between the present station and the next station (t Asn+1 − t Asn ) (X 6 ) (7) e actual running times between the present station and the next station (t Asn+1 ′ − t Asn ′ ) (X 7 ) (8) Buffer time, which indicates the difference between X 6 and actual minimum running time of all trains between the present station and the next station Y represents the arrival delay time at the next station s n+1 of train T k (t Asn+1 ′ − t Asn+1 ). ere are multiple potential interdependent features (e.g., the train number, the length between two adjacent stations) that are intently related to train delay prediction.
Based on the collected data and the experience of dispatchers, we ultimately select nine features that are possible to influence train delays.
We apply extra trees classifier to analyze the correlation between all features and train delays.
e results are exhibited in Figure 2. As shown in the figure, the deeper the red, the higher the importance. ere is no doubt that X 9 has the highest importance with Y. e actual and scheduled running times between the present station and the next station also contribute largely to the accurate prediction of Y. Moreover, the buffer time, which is an important factor affecting the length of the train recovery time, is also comparatively prominent in delay prediction process. Taking the buffer time into account allows us to obtain more realistic prediction results. e train arrival delay prediction problem in this paper is transformed into the following expression: where X i is the information of train T k running through stations s n , s n+1 , and s n−1 , Y is the arrival delay time at the following station s n+1 of train T k , and f(x) is the prediction process.

Methodology
is paper proposes a hybrid model of ELM and PSO for train delay prediction. ELM is widely used in regression problems because of its advantages of small computation, good generalization performance, and fast convergence speed [19][20][21]. PSO algorithm is a random and parallel optimization algorithm, which has the advantages of fast convergence speed and simple algorithm [25,28]. erefore, we aim to combine the advantages of ELM and PSO algorithm to improve the behavioral knowledge in the delay prediction domain. For the principle of ELM and PSO, one can refer to Li et al. [29], Perceptron et al. [30], and Zhang et al. [31]. e running process of the proposed hybrid method is as follows: Arrival delay time

Nominal timetable
Actual timetable Breakdown Figure 1: Conversion from the train itinerary to mathematical notation.
Step 1: data preprocessing. First, 9 features mentioned in Section 2 are generated a N × 9 matrix, where N represents the total number of events according to the train operation records. Second, remove abnormal delay (trains may be canceled due to some emergencies) to reduce its interference with predictions. ird, fill in the missing data according to the adjacent data around the missing ones.
Step 2: initializing the parameters and population. Parameters such as maximal iteration number, population size, and speed and position of the first particle are initialized. Each particle l has its own position A l (ite) and speed V l (ite). e position of each particle in the population is equivalent to the number of neurons in the hidden layer of ELM. erefore, there is merely one dimension of each particle: where h l (ite) represents the number of hidden layer neurons of ELM in the ite th iteration.
Step 3: ELM (hidden layer activation function: sigmoid function) is used. e processed feature set X and the position of particles (the number of hidden layer neurons) generated by PSO are input into ELM. Consequently, ELM can output the weight matrix under the current number of hidden layer neurons. e function of calculating the fitness of particles is as follows: where N is the number of samples, y n is the actual output value on test set, and f n (x) is the predicted output value on test set.
Step 4: calculate the fitness of each particle, and compare to update the current best fitness and its particle location.
Step 5: start the iteration. PSO will update the positions and velocities of all particles, and then repeat step 4. If the maximum number of iterations is exceeded, it will end the process.
Step 6: output the results. We can obtain the output value on test set as well as the optimal number of hidden layer neurons.
e specific flowchart is shown in Figure 3.

Dataset Description.
e data employed to verify the ELM-PSO are obtained from the dispatching office of a railway bureau. e 15 stations applied in the study include a section, the length of which is 1096 km from CBN to GZS on the double-track W-G HSR line.
ere are more than 400,000 data points used in this study, with a time span from October 2018 to April 2019. e train original operation data and route map of the targeted 15 stations on the W-G HSR line are shown in Table 1 and Figure 4.
Analysis of the delay ratio of each station reveals not only the condition of each station but also an increasing emphasis on the indispensability of train arrival delay prediction, which contributes to improving the ability of each station to cope with and even inhibit the increase in train arrival delays. Trains with arrival delay greater than 4 minutes are considered as delayed trains. What is intuitively presented in Figure 5 is that the delay ratios of all the stations are basically  Calculating the fitness of each particle and updating the current best fitness   Journal of Advanced Transportation not optimistic. At the same time, the delay ratios of the two targeted stations, CZW and GZN, are particularly dreadful, with arrival delay ratios of 0.12. Our goal is to minimize the arrival delay ratio by predicting the arrival delay at each station.

Baseline Models.
In order to compare the performance of our proposed method, the k-nearest neighbor (KNN), categorical boosting (CB), gradient boosting decision tree (GBDT), and Lasso are used as baseline models. We take 20% of the dataset as the test set and the rest as the training set. e experiment runs in Python in an environment with an Intel ® Core i5-6200U processor 2.13 GHz and 8 GB RAM.
Briefly, an overview description and hyperparameter settings of each model are as follows: (1) KNN: KNN algorithm is extensively applied in differing applications massively, owing to its simplicity, comprehensibility, and relatively promising manifestation [32].
(2) CB: CB is a machine learning model based on gradient boosting decision tree (GBDT) [33,34]. CB is an outstanding technology, especially for datasets with heterogeneous features, noisy data, and complex dependencies.
(3) GBDT: GBDT has been employed to numerous problems [35], which has many nonlinear transformations and strong expression ability and does not need to do complex feature engineering and feature transformation.
Lasso is a prevailing technique, capable of simultaneously performing regularization and feature filtering. Furthermore, data can be analyzed from multiple dimensions by Lasso [36].

Evaluation Metrics.
Root mean squared error (RMSE), mean absolute error (MAE), and R-squared are selected to assess the models. e definitions of the error metrics are shown in equation (4), equation (5), and equation (6): where y i is an observed value, y i is a predicted value, y i is the average value of y i , and N represents the sample size.

PSO Optimization Result
Comparison. e process of PSO tuning the hyperparameter is shown in Figure 6. e fitness value achieves minimum after five iterations. e best fitness value is 1.0387 on test set when there are 1462 neurons of the ELM. e structure of the network is optimal correspondingly. e search range  of hyperparameter is determined by manually trying several values in the range of . When the hyperparameter value is greater than 2000, the fitness tends to be stable. Also, the time consumption is multiplied acutely. Ultimately, we decide to limit the search range to , weighing time consumption and precision. e computational cost is shown in Table 2, and the results are the optimal results of each model running several times. We gain two observations from this table. First, the optimal particle number of ELM-PSO always focuses on 1462; the only difference is the number of iterations at best RMSE. Second, compared with ELM-BO and ELM-GA, ELM-PSO is the ideal model that takes the shortest time to locate the optimal fitness on the test set.

Model Accuracy Comparison.
In this section, the performance comparison between ELM-PSO and baseline models is performed.
First, we compare the overall performance of the five models. e evaluation metrics are R-squared, the MAE, and the RMSE. e corresponding results on test set and training set are summarized in Tables 3 and 4, respectively. e ELM-PSO model performs optimally among the five models in not only the training set (R-squared � 0.9973; MAE � 0.3377; RMSE � 0.8247), but also the test set (R-squared � 0.9955; MAE � 0.3490; RMSE � 1.0387). Although the running time of our model has no obvious advantage compared with other models on test set, it is within the tolerable range. Also, we notice that there are models that perform well in the training set, but are not good in the test set, which reveals the paramountcy of enough generalization ability of models in the prediction problem.
en, by separating the delay duration into three bins (i.e., [0-1200 s], >1200 s, and all delayed trains (trains with arrival delay greater than 240 seconds)), we attempt to measure the capability of the benchmark models and our model to seize the features of train delays to varying degrees on test set. As is distinctly shown in Table 5, the proposed model outperforms the other benchmark models in each time horizon and each evaluation metric, achieving, for example, an RMSE of 0.5201 in the first bin. is finding is taken as evidence that our model can constantly adjust itself to capture the characteristics of varying degrees of train delays to enhance the prediction accuracy. To further assess the performance of our model, the comprehensive analyses are discussed in a later section.

Further Analysis.
On the basis of the previous section, we will evaluate the performance of the ELM-PSO model from other angles, including the prediction errors for each station precisely, the prediction correctness, and the robustness.
First and foremost, the errors of the ELM-PSO model for the predicted arrival delays are calculated at the station level on test set. Viewing the overall situation in Figure 7, we have noticed that the prediction errors are low. e MAE and Rsquared both remain stable at each station. And the RMSEs for different stations are mostly less than 90s. However, great fluctuations occur at the YYE and QY stations. Reasons resulting in such phenomena are that the two stations are close to the transfer stations and the buffer times of YYE and QY are both small. e prediction accuracy at these stations tends to be slightly hindered by these factors.
In addition, to put forward more detailed and embedded results, we describe the correctness of the absolute residual between the predicted values and the actual values for each station from three intervals (i.e., <30 s, 30 s-60 s, and 60 s-90 s) (Figure 8). In <30 s interval, the correctness of each stations exceeds 75%. In brief, the overall results confirm the impressive prediction correctness of the proposed model.
At last, we investigate the robustness of our model to data size. In detail, we further train and test our model using 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, and 90%, respectively, of the total data as test set, and compare the results with the baseline models. e data sizes used in the experiments are shown in Figure 9. e performance of our model on both training and test set is more outstanding than others. As we can see, the RMSEs of our model stay pretty stable using data with different sizes, while the RMSEs of other models are higher and fluctuating. ese figures show that the proposed model has the smallest predictive RMSE, MAE, and R-squared for all trains, which demonstrates the robustness of our model to different data sizes.

Statistical Tests.
In this section, the Friedman test (FT) and Wilcoxon signed rank test (WSRT) are used to verify the advantages of our proposed method compared with other methods [35,38].
e results FT and WSRT are shown in Table 6. FT algorithm is a nonparametric statistical tool, which determines the difference by ranking    Journal of Advanced Transportation the performance of each method. It can be seen from the table that the proposed method has better ranking than CB, GBDT, Lasso, and KNN at 5% significance level; that is, the efficiency is better. In addition, the results of WSRT showed that the p-value was less than 0.05 (5% significance level), which rejected the null hypothesis. It means that there is a statistical difference between the proposed method and other methods. at is, the performance of the proposed method is better than that of other methods.

Conclusion
In this paper, a hybrid ELM-PSO method is proposed to predict train delays. e ELM can overcome the shortcomings of backpropagation training algorithm, and the advantage of PSO is its excellent ability in searching the best hyperparameter. Four benchmark models, CB, KNN, GBDT, and Lasso models, are selected to compare with proposed model. ese models were run on the same data collected from China Railways. ELM-PSO tends to have a better performance and generalization ability (Rsquared � 0.9955, MAE � 0.3490, RMSE � 1.0387) than the other models on the test set. Our work can not only provide sufficient time and auxiliary decision for the dispatcher to make reasonable optimization and adjustment plan, but also have practical significance for improving the quality of railway service and helping passengers estimate their travel time.
e dataset used in this paper contains train delays under all types of scenarios. erefore, in the future, we will  Figure 9: MAE, RMSE, and R-squared values on training and test sets with different data sizes (LASSO1 represents the performance on training set; LASSO2 represents the performance). consider dividing all the data into certain types of delay scenarios according to particular rules and implementing currently prevalent models to train and predict each scenario to achieve a higher accuracy. Finally, in terms of the input features, all the information of the features in this paper can be obtained from train timetables. In the future, other types of features, such as the infrastructure, weather features, and other HSR lines obstruction, will be taken into account.

Data Availability
e data used to support the findings of this study were supplied by China Railway Guangzhou Bureau Group Co. Ltd. under license and so cannot be made freely available. Access to these data should be considered by the corresponding author upon request, with permission of China Railway Guangzhou Bureau Group Co. Ltd.