Modeling the Influence of Disturbances in High-Speed Railway Systems

Accurately forecasting the influence of disturbances in High-Speed Railways (HSR) has great significance for improving real-time train dispatching and operation management. In this paper, we show how to use historical train operation records to estimate the influence of high-speed train disturbances (HSTD), including the number of affected trains (NAT) and total delayed time (TDT), considering the timetable and disturbance characteristics.We first extracted data about the disturbances and their affected train groups from historical train operation records of Wuhan-Guangzhou (W-G) HSR in China. Then, in order to recognize the concatenations and differences of disturbances, we used a K-Means clustering algorithm to classify them into four categories. Next, parametric andnonparametric density estimation approacheswere applied to fit the distributions ofNATandTDTof each clustered category, and the goodness-of-fit testing results showed that Log-normal and Gamma distribution probability densities are the best functions to approximate the distribution of NAT andTDT of different disturbance clusters. Specifically, the validation results show that the proposed models accurately revealed the characteristics of HSTD and that these models can be used in real-time dispatch to predict the NAT and TDT, once the basic features of disturbances are known.


Introduction
An operating train may encounter various unexpected disturbances such as bad weather, power outage, facility failures, and so on [1], which can lead to considerable losses for both railway managers and travelers.For example, in the Dutch railway network, statistics show that there are approximately 22 infrastructure-related delays per day, lasting on average for 1.7 hours [2].According to the statistics of the China Railway Corporation, the average departure punctuality in origin stations for China's 23,000km high-speed rail (HSR) network was as high as 98.8% in 2016.However, due to various disturbances during the trip, the average punctuality at final destination stations was less than 90%, even though delays smaller than five minutes are considered punctual [3].
When disturbances occur, dispatchers need to anticipate the potential influences of a specific delay.They need to estimate the number of affected trains (NAT) and total delayed time (TDT) before rescheduling the timetable.Modeling the high-speed train disturbances (HSTD) will be helpful and of great significance, although it is extremely challenging due to the following two aspects: (1) Various influencing factors.The influence of railway disturbances is related to various factors, for example, timetable structure, facility conditions, and experience and preference of dispatchers which makes it difficult for these factors to be interpreted through functional relationships.
(2) Complex train interactions.Due to resources occupation conflicts and the continuity of train operation, trains are interactive, which makes mathematical model incapable of modeling train interaction.
In practice, some skilled dispatchers usually predict HSTD empirically, which leads to differences in dispatching even for the same dispatcher when working in different situations.Data-driven approaches have recently gained more attention due to their better understanding of train delay concatenation and the fact they are more supportive of robust timetables and real-time dispatching [4].In addition to the availability of train operation records, advanced data-mining techniques enable us to address these problems from a dataanalysis perspective.Train operation records are therefore assumed to be the interactive consequences of all influencing factors.Therefore, mining train operation records provides us with a brand-new way of examining train interactions arising from heterogeneous factors.
To bridge the gap between the empirical and mathematical models, this paper aims to establish data-driven models of the NAT and TDT caused by different types of disturbances using the train operation data of Wuhan-Guangzhou (W-G) HSR in China.To this end, we used a K-Means algorithm to categorize the extracted disturbances according to their influencing factors.Next, we applied five widely used probability density models and two kernel functions to fit the distributions of NAT and TDT.We then selected the best models for each cluster based on goodness-of-fit testing.Finally, the test data from the operation records from nine months were used to validate the generalization of the fitted models, which showed that these models could be applied to estimate the NAT and TDT of future disturbances.

Literature Review
Since the 1970s, there has been active research on disturbance management in train dispatching [5].Most recently, the topic of the INFORMS 2018 Railroad Problem Solving Competition was "Predicting Near-Term Train Schedule Performance and Delay" using operational records.A large number of methods and algorithms have been proposed to improve rail operations, but due to unavailable or insufficient historical data, research was mainly based on simulation and mathematical delay propagation models.
Many approaches have been uncovered and proposed to manage railway disturbances.Exogenous factors, such as natural disasters, and bad weather conditions, and endogenous factors, such as operation interference resulting from equipment failure, man-made faults, railway construction, temporary speed limitations, defective braking systems, signal and interlocking failures, and excessive passenger demand can contribute, alone or together, to the primary delay [6][7][8].Also, if the running and dwell times increase due to unexpected disturbances, it can result in knock-on delays and delays for other trains [9].Serious disruptions such as switch or signal failures, if not managed effectively, can result in queuing of trains creating a chain of delayed trains.The experience from the Taiwan High-speed Rail shows that shortening the maintenance cycle can effectively alleviate the problem of train delay caused by signal failures [10].
Some studies have made contributions on statistical models of delay and the respective fitness models.The Weibull, Gamma, and Log-normal distributions have been adopted in several studies [11,12].It was shown that the distributional form of primary delays, and the affected number of trains could be well-approximated by classical methods such as Log-normal distribution and linear regression models [13].A q-exponential function is used to demonstrate the distribution of train delays on the British railway network [14].Using spatial and temporal resolution transport data from the UK road and rail networks and the intense storms of 28 June 2012 as a case study, a novel exploration of the impacts of an extreme event has been carried out in [7].Given the HSR operation data, the maximum likelihood estimation method was used to determine the probability distribution of the different disturbances factors and the distributions of affected trains; however, the models of primary delay consequences have not been established in detail [15].Probabilistic distribution functions of both train arrival and departure delays at the individual station were derived in general based on the data from Beijing-Shanghai HSR [16].
Data-driven research studies proposed for delay/disruption management mainly focused on using regression or distribution approaches to fit delay data.Van der Meer et al. mined data from peak hours, including rolling stock, and weather data and developed a predictive model involving the mining of track occupation data for delay estimations [9].A data-mining approach was used for analyzing rail transport delay chains, with data from passenger train traffic on the Finnish rail network, but the data from the train running process was limited to one month [4].Murali et al. reported a delay regression-based estimation technique that models delay as a function of train mix and network topology [17].A statistical analysis of train delays in the Eindhoven Station in the Netherlands was used to explain systematic delay propagation based on the use of a robust linear regression model to uncover the correlations between arrival delays [18].Recently, Kecman and Goverde developed separate predictive models for the estimation of running and dwell times by collecting data on the respective process types from a training set [19].Javad et al. examined different distribution models for running times of individual sections in an HSR system and showed that the log-logistic probability density function is the best distributional form to approximate the empirical distribution of running times on the specified line [3].A hybrid Bayesian network model is also established to predict arrival and departure delays for Wuhan-Guangzhou HSR [20].
A review of the literature reveals that only a few studies focus on the modeling of the NAT and TDT of disturbances, especially in HSR operations.It is crucial for HSR operating companies to predict and reduce disruptions in train operations and to operate as closely as possible to published timetables.It is therefore important that they identify the severity of displacements or interruptions to train services.This can help dispatchers reduce delay propagation and the possibility of aggravation through effective designing of timetables or real-time dispatching decisions.This study aims to fill this gap by conducting a detailed factor-specific analysis of delays based on empirical data from W-G HSR.

Preliminaries
. .Data Description.The W-G HSR, which has a length of 1,096 km, is one of the busiest passenger railway lines in the country and joins with the Guangzhou-Shenzhen, Hengyang-Liuzhou, and Shanghai-Kunming HSRs at the GZS, HYE, and CSS stations, respectively.Trains operating on this line are all equipped with the Chinese Train Control System (CTCS), which allows a maximum speed of 350 km/h, and an Automatic Train Supervision system, which records the movements of all trains.The W-G HSR line, where only high-speed trains run on, is totally separated from conventional lines, but it connects with Shanghai-Kunming HSR line at CSS station, Hengyang-Liuzhou HSR line at HYE station, and Guangzhou-Shenzhen and Guiyang-Guangzhou HSR lines at GZN station.At these stations, train termination, turn-over, and cross-line are allowed.We studied the movement of trains in the northbound direction on this line, that is, from GZS to CSS, as shown in Figure 1.
The data collected include 57,796 trains in the GZS-HYE section and 64,547 HSR trains in the HYE-CSS section.The data contain operational records covering the period from March 24, 2015 to November 10, 2016, comprising scheduled/actual arrival/departure records for each train at each station, train numbers, dates, and information on occupied tracks.All the data with a time format was recorded in full minutes due to the accuracy of the system, as shown in Table 1.
In order to better understand the distributional pattern of the influence of disturbances, we first analyzed the train delay regularity by visualizing the arrival and departure delay distribution at HSW station.The histograms in Figure 2 clearly show that both arrival and departure delays follow a heavy-tailed distribution, from which we can infer that the influence of disturbance (NAT and TDT), like other problems in complex systems, has a right-skewed and heavy-tailed distribution.
. .Problem Description.In both practice and research, train delays and disturbances are always classified according to their causes.However, this method seems to have a drawback, as some disturbances with different causes sometimes have the same influencing mechanism on train operation.For example, track failure in sections and power supply fault in sections are different causes, but they have the same effect on train operation, because in both conditions, trains have to wait for the availability of the section.In other words, from the perspective of railway management, it is significant to classify disturbances according to their impacts on train operation.Besides, other cases like signal fault and turnout fault, speed limitation for bad weather and speed limitation for construction, they both have the same effect on train operation.In this research, we intend to classify the disturbances focusing on some key factors that influence their effects on train operation.

Clustering Model for Disturbances
. .Influencing Factors.In order to meet passenger demand, train services tend to vary across different periods and segments even on the same HSR line.Train interval is a key factor in delay propagation, as a disturbance tends to cause more severe effects in smaller interval periods and segments, and smaller effects in larger interval periods and segments.Figures 3(a) and 3(b) reveal the NAT and TDT of disturbances at GZN and HSW.It clearly shows that the NAT and TDT of W-G HSR differ significantly across time periods and segments.In addition, the disturbance length will directly influence their consequences; the longer the disturbance, the more delayed trains and total delays it will cause.Based on the analysis, the following features for clustering models and indexes to measure the influence of disturbances were ascertained: (i) Train interval (I): the average scheduled train interval when a disturbance occurs (minute); (ii) Occurrence time (T): the time when a disturbance starts (in the railway operation, it can be indicated by the scheduled arrival time of the first delayed train in its affected trains group); (iii) Disturbances length (L): the time span from starting time to ending time of the disturbance (it can be indicated by the delay times of the first train in the affected train group (minute).(iv) NAT and TDT indexes are used to measure their influence.
Figure 4, which shows a real disturbance in YDW-SG section on W-G HSR line, demonstrates the calculating methods of the selected features.If seven trains were delayed due to this disturbance (the first seven trains were delayed but the eighth train was on schedule when they arrived at  SG station), we thus define these seven trains as affected train group.In this case, NAT value is thus seven, TDT value is the sum of arrival delays of these seven trains at SG station, and the interval (I) is the average interval of these seven trains at SG station.The disturbance length (L) which is defined as the difference of its occurrence time and the time when this section is available in this research was calculated by the delay times of the first train in the affected train group.
Based on the indexes, data on 6006 disturbances and their consequences on train operations were extracted from the raw data; five cases are shown in Table 2.In order to validate the proposed model, the extracted data were split into a training dataset, which included 3154 disturbances in the preceding 12 months, and a validation dataset, which contained 2852 disturbances over the following 9 months.We extracted only those disturbances with a disturbance longer than 4 minutes according to the standard set by the Chinese Railway Company.Shorter disturbances, which can be assimilated by the time supplements distributed in timetables, tend not to cause delay propagation and are therefore eliminated from the dataset.

. . K-Means
where Equation (1) indicates the nearness between samples in a cluster and their mean vector.The core principle of the K-Means cluster is to choose K points in space as centers and assign the samples to their nearest cluster.By iteratively updating their centers, the objective is to minimize until the stopping condition is satisfied.The K-Means clustering algorithm can be concluded as shown in Algorithm 1 and more details are shown in literature [21].
. .Model Performance.The performance of clustering models is commonly evaluated from two perspectives: (1) the tightness of the samples in the cluster and (2) the distances between clusters.The widely used evaluation indexes are distance and covariance [22,23], both of which require smaller values among samples in the same cluster and larger values among samples in different clusters.To systematically where () is the average distance between   and other samples in a cluster; () is the average distance between   and samples in other clusters; notifying: the range of SC is where  is the number of samples, K is the number of clusters, and B K is the covariance matrix among samples in the same cluster, while W K is the covariance matrix among samples in different clusters, and tr(g) is the trace of the matrix.According to (2) and (3), better clustering results require larger SC and CHS.
To obtain reasonable clustering results, we also applied other clustering algorithms including the BIRTH, Spectral, and Agglomerative clustering models and investigated the number of clusters (K) of each model from two to twenty in order to categorize the disturbances.After standardization of the input data, clustering was performed, and the clustering result of each model is shown in Figure 5.We can first choose K-Means model as the best clustering model from the candidates, as it almost had the best performances on all different number of clusters on both SC and CHS metrics.An exception was when K equals two, which resulted in the highest SC index but the lowest CHS index of BIRCH model.Considering that clustering models should be evaluated from both distances-and covariance-based indexes to obtain systematical results, we thus chose K-Means model, which has both high SC and CHS metrics, as the clustering model for railway disturbances.Then, when we chose the K value, we found that three, four, and five clusters were all possible for K-Means algorithm.We, therefore, standardized the SC and CHS values of K-Means model between zero and one, and finally chose four as the best number of clusters for it resulted in the highest sum of standardized SC and CHS.Therefore, a K-Means algorithm that has four clusters was selected as the clustering model for HSR disturbances.
Finally, HSR disturbances were classified into four categories using the K-Means clustering algorithm, as shown in Figure 6.According to the distribution of each cluster, we can define them as follows: (i) Cluster A: disturbances occurred between 7:00 am and 23:30 pm; train intervals range from 3 to 13 minutes; lengths range from 5 to 14 minutes.(ii) Cluster B: disturbances occurred between 7:30 am and 23:30 pm; train intervals range from 5 to 25 minutes; lengths range from 14 to 31 minutes.(iii) Cluster C: disturbances occurred between 7:30 am and 23:30 pm; train intervals range from 4 to 30 minutes; lengths are longer than 31 minutes.(iv) Cluster D: disturbances occurred between 10:30 am and 23:30 pm; train intervals range from 13 to 30 minutes; lengths range from 5 to 22 minutes.
The statistics of NAT and TDT of each cluster are shown in Table 3.

Estimating Models of NAT and TDT
. .Candidate Models.In order to reveal the rules of disturbance influences, we first investigated the histogram 6:00 8:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 0:00  of NAT and TDT, as shown in Figures 7 and 8, which indicates that both NAT and TDT appear to have rightskewed distributions.Since the locations and shapes of the histograms are very different, we fitted the data using five common right-skewed distribution models and two kernel functions as the candidate models.The distribution models including the Log-normal, Weibull, Gamma, Exponential, and Logistic and kernel functions, including Gaussian and Epanechnikov kernels were employed to fit the data from parametric and nonparametric perspectives.The probability distribution models and kernel functions of these models are as follows [24].
where  is the shape parameter and  is the location parameter.
(ii) Weibull Distribution where >0 is the scale parameter and k>0 is the shape parameter.
(iii) Gamma Distribution where  is the shape parameter and  is the scale parameter.
(iv) Exponential Distribution where  is the shape parameter.
Where u is the location parameter and s is the scale parameter.
(vi) Gaussian Kernel The parameters were estimated using maximum likelihood and are shown in Table 4, and the fitting results of the distribution models are shown in Figures 7 and 8.These figures clearly show that all the candidate probability density models mimic the shape of the nonparametric estimation using Gaussian and Epanechnikov kernels, which enables us to choose the best model from the parametric candidates as the estimating model of disturbance influences.  . .Goodness-of-Fit Test.In this section, we evaluated the goodness-of-fit of the distribution models using the Kolmogorov-Smirnov (K-S) method [25] and selected the optimal distribution model for each category.
K-S testing is proposed to test whether a group of data follows a theoretical distribution model; its null hypothesis is that the dataset follows a theoretical distribution.Its testing statistics are the largest difference between the cumulative distribution function of the data and the theoretical distribution, as shown in (12).Because the number of trains is the integer, and the historical train operation data were recorded in the minimum unit of one minute, we inserted some random numbers that follow uniform distribution in order to meet the continuity requirement of the K-S method.

𝐷 = max
() −  ()      (12) where   () is the cumulative distribution function of the samples, which are the NAT and TDT; () is the cumulative distribution function of the theoretical distribution models, which are the five alternative distribution models.We chose the significance level  = 0.05; when the sample size is large enough (over 50), the critical value of D should be where n is the sample size.
According to the rules of the K-S test, if D<D 0.05 , the null hypothesis is accepted and the samples are considered as the following theoretical distribution.The smaller D is, the closer the sample is to the theoretical distribution.The K-S testing results of all the models are shown in Table 5.
Finally, we chose the model that passes the K-S test and has the smallest D as the distribution model of each disturbance cluster.The parameters of distribution models for NAT and TDT of each cluster are shown in Table 6.  . .Generalization Test.In order to investigate the generalization of the fitted distribution models, we used the disturbances in the following nine months of W-G HSR line to validate the models.We first fed the disturbances into the proposed K-Means clustering algorithm and obtained the clustering labels of each sample.The clusters that have the same labels as the training dataset are used to validate the fitted models.The fitting results are shown in Figures 9 and  10, and the descriptive statistics and their K-S testing results are shown in Table 7.
The testing results indicate that all the fitted models pass the goodness-of-fit and generalization testing.The fitted models can therefore accurately reveal the distributive disciplines of NAT and TDT and are of great significance to realtime train dispatching.

Conclusion
This paper presents a data-mining approach, which is composed of the K-Means clustering model and probability density fitting techniques to estimate NAT and TDT resulting from HSR operation disturbances, considering timetable and disturbance characteristics.The proposed clustering algorithm, which fully takes the disturbance characteristics and timetable structure into consideration, compensates the shortage of existing classification method of railway disturbance.In real-time operation, once a disturbance happens, its features are therefore known (including train interval, length and occurrence time of disturbance), which can be fed into the trained K-Means clustering algorithm, and its classifying label can be obtained.Then, according to the label, the specific distribution model can be chosen to calculate the probabilities of different outcomes.The probability models being fitted using real operation data can help dispatchers improve their decision-making qualities.With the probability distribution models, the dispatchers can obtain the realtime and future probabilities of NAT and TDT of any train operation disturbances.The influencing patterns of disturbances arising from the train operation data can be used to improve the rescheduling and adjusting abilities in disturbance situations and help dispatchers improve their decisionmaking qualities, by providing dispatchers with accurate estimation of NAT and TDT.Also, these probability models can improve the train operation and disturbance management in simulation systems, as they are more accurate than those hypothetical models that bring certain gaps between simulations and practice and usually over assume and ignore some situations and constraints of train operations.
The established distribution models are general models for W-G and X-S HSR lines.However, train services and infrastructure could both vary at different stations, leading to differences in distribution model parameters.Therefore, our future work will focus on establishing models for each station on the HSR network.

Figure 1 :
Figure 1: Map of W-G HSR line.

Figure 2 :Figure 3 :
Figure 2: Arrival and departure delay distribution of HSW station.

Figure 4 :
Figure 4: A real disturbance happened on W-G HSR line shown in time-space diagram (horizontal axis is time, and vertical axis is space).
Clustering.A K-Means cluster is a typical and popular algorithm that has strong robustness on highdimensional and multicollinear datasets in unsupervised learning.For the given dataset  = { 1 ,  2 , ⋅ ⋅ ⋅ ,   }, assuming that the clustering centers  = { 1 ,  2 , ⋅ ⋅ ⋅ ,   } are initialized, the object of K-Means is to minimize the mean squared error (MSE): min (  ∑ =1 ∑   ∈        −       2 ) CHS of clustering models

Figure 5 :
Figure 5: SC and CHS of clustering models.

Figure 6 :
Figure 6: Spatial distribution of each disturbance cluster.

Figure 9 :Figure 10 :
Figure 9: Fitting results of NAT on test data.

Table 1 :
Train operation data format in database.The passage tracks at the station are labeled with Roman characters, while the dwelling tracks are labeled with numbers.

Table 2 :
Examples of disturbances and their influences for modeling.

Table 3 :
Statistics of NAT and TDT.

Table 4 :
Fitted parameters of NAT and TDT.

Table 5 :
Goodness-of-fit testing of NAT and TDT using K-S method.
* " indicates the best models.

Table 6 :
Parameters of the best distribution models.

Table 7 :
Statistics and the testing result of NAT and TDT using K-S method.