Analyzing Capacity Utilization and Travel Patterns of Chinese High-Speed Trains: An Exploratory Data Mining Approach

Train capacity utilization (TCU), usually represented by passenger load factor (PLF), is a critical measure of effectiveness for rail operation. In literature, efforts are usually made to improve capacity utilization by optimizing rail operation and management strategies. Comparably little attention is paid to analyzing the factors that affect TCU and to understanding the behavioral patterns behind it. This paper applies exploratory data mining techniques to a 3-month long real world train operation data of the BeijingShanghai High-Speed Railway. Principal component analysis (PCA) is conducted to find the principal components that can efficiently represent the collected data. Clustering techniques are then applied to understand the unique characteristics that affect PLF and the travel pattern.The findings can be further used to guide train operation planning and facilitate better decision-making.


Introduction
Due to the vast land span and enormous transportation demand in China, railway transportation plays an increasingly vital role in China's economy. In general, Chinese high-speed rails are more preferable compared to other transportation modes, especially for long-distance trips. During the last five years, the railway passenger volume in China has been increasing with a yearly growth rate of 10%. According to the 2016 statistics, the Chinese railway passenger volume is 2.8 billion, which has increased 11% compared to 2015. Despite the continuous growth of railway transportation in China, it is found that the train capacity of some passenger lines is underutilized, especially during off-peak seasons. For example, the average passenger load factor of high-speed trains in China is around 60-70%. In extreme cases, the number is less than 40%. And this has motivated transportation researchers to develop methods to reduce such capacity waste. Optimizing train capacity utilization (TCU) is challenging. The challenges are mainly bifold: (i) the passenger travel patterns are highly stochastic and unpredictable; (ii) many factors may influence TCU, and the causalities are hard to be captured. To overcome these challenges, it has become an imperative task to find out the factors that affect TCU and to discover the behavioral patterns behind it.
Generally speaking, there are two approaches to understanding and improving train capacity utilization. One is model-based approach, which applies analytical models to study the effects of train operation and management strategies (e.g., timetabling and ticketing) on train capacity utilization. The second is data mining approach that empirically analyzes TCU and the interrelationship between TCU and the influential factors.
The model-based approach usually assumes that the causalities and quantitative relationships between rail passenger's choice and train operation/management factors are given. For example, pricing and ticketing are often considered as the main management strategies that directly affect TCU. For this, researchers have developed optimal pricing models for better train utilization and revenue generation. Zhang et al. [1] introduced a discriminative pricing method to improve TCU. You [2] formulated a constrained nonlinear integer programming model for railway seat allocation. Shibata et 2 Journal of Advanced Transportation al. [3], Park et al. [4], and Bao [5] developed seat class assignment models to increase the utilization rate of intercity railway. Wang et al. [6] studied the seat allocation problem to optimize TCU, with considerations of the passengers' random choice behaviors. Another portion of research targets improving TCU by optimizing train operational factors such as train scheduling and timetabling. Zha [7], Lan [8], and Shi et al. [9] developed train operation optimization models to maximize train capacity utilization. Bussieck et al. [10] proposed a novel method to optimize train operation plan by minimizing the number of transfer trips. Methods to improve TCU and revenue generation were also studied by Zhou et al. [11], Cadarso et al. [12], and Robenek et al. [13]. These studies usually assume passenger volume and trip-making decisions are known and fixed. Such assumptions, albeit idealistic, are quite common in literature mainly due to the lack of real world data (which is often true for rail transportation studies in China).
In contrast to the first approach, the empirical approach applies data mining techniques for pattern recognition and knowledge discovery from real world rail operational data. Although data mining approaches have been widely applied in many transportation applications (e.g., Zheng et al. [14]; Xie et al. [15]; Anand et al. [16]), such studies are rare in the field of railway transportation, mainly due to the lack of data. Only a handful of such examples are found in literature. For example, Xu et al. [17] used data mining techniques to analyze the time sequence and the spatial influence of trip making and presented a new approach for trip forecasting. Liu et al. [18] applied fuzzy clustering model to analyze passengers' travel behaviors and key factors relevant to the level of service. Zheng et al. [14] used a data mining approach to analyze train passenger flow and developed a model to forecast passenger volume. To the authors' understanding, no previous work has been done to analyze the influential factors of TCU.
The paper makes contributions in two aspects. (i) Exploratory data mining techniques are applied to a dataset that contains 3-month long real world train operational data of the Beijing-Shanghai High-Speed Railway. Such information is usually held by railway companies and is not available to the general public and the academia. (ii) The unique characteristics that affect PLF and the underlying behavioral patterns are discovered and further analyzed.
The rest of the paper is organized as follows. In Section 2, we briefly describe the data source used in the study. Section 3 presents the key methodologies used for data mining and knowledge discovery from train operation data. The experiment and numerical results are presented in Section 4, followed by the concluding remarks in Section 5.

Data Description
The Railway Passenger Transport Management Information System is an official rail operation and management system maintained by China Railway Corporation (CRC). The dataset used for this study was retrieved from the system, which contains 3-month rail operation information of the Beijing-Shanghai High-Speed Railway. This railway line is the most important transportation corridor connecting two largest cities of China. The rail-line has a total length of 1318 km and goes through 24 stations. These 24 stations can be further categorized based on their administrative levels, as shown in Table 1. In general, higher level indicates higher population and higher socioeconomic status. The dataset was further processed to extract 33 representative operational features. Descriptions of the features can be found in Table 2.
The operational features include passenger load factor (PLF) that directly indicates the capacity utilization of a train, date, ticketing strategy (TS), run duration (RDR), departure time (DT), train type (TT), number of stops (NS), run distance (RDI), stop schedule (SS), run speed (RS), and load coefficients (LCs) for all sections along the railway line. The authors are aware of other factors such as trip purposes and passenger social-economic status that could also affect TCU, but such information is not available from the CRC database. Since the ticket prices remain stable during the study period, pricing is not considered as an influential feature in the study.
In literature, PLF is used to assess TCU, and load coefficients are used to assess sectional capacity utilization. In this study, both PLF and load coefficients are considered as important features. Let C denote the train capacity (i.e., number of seats), D is the running distance, S is the number    Date  TS  RDR  DT  TT  SN  RDI  SS  RS  l1  l2  l3  l4  l5  l6  l7  l8  l9  l10  l11  l12  l13  l14  l15  l16  l17  l18  l19  l20  l21  l22  of stations. PLF can be expressed as (1). Similar definition can be found in Bao et al. [19,20].
Here and indicate the passenger OD volume and the section length between stations and , respectively. Since passenger OD is not available from the dataset, equivalently, we can use the sectional passenger volumes (V ) to calculate PLF, as in Note that the load coefficient of section is known as = V / according to [21]. Therefore, we can derive the following relationship between PLF and the sectional load coefficients, as in In Figure 1, we first show the aggregated statistics of the collected data. Figure 1(a)  trend line. Figure 1(b) shows the average load coefficients of the upward and downward trains. It can be found that the average PLF decreases during the whole study period and the travel pattern may be characterized by two segment trips including s1(BJS)-s12(XZE) and s12(XZE)-s24(SHHQ).

Methodology
In the context of statistical analysis and data mining, exploratory data analysis (EDA) is a process of detective work that does not require a predetermined hypothesis to be tested. Rather, the role of EDA is to explore data in as many ways as possible, until a plausible "story" of the data is unearthed. Formal definitions of EDA and exploratory data mining can be found in Tukey [22] and Yu [23]. In this section, exploratory data mining approaches are applied to gain insights of the structure of the data and the underlying travel patterns. First, principal component analysis (PCA) is used to select the most salient features (called principal components) to represent the train operation data. Secondly, we use clustering techniques to discover the intrinsic relationship between TCU and the principal components.

Principal Component Analysis.
PCA is a commonly used technique for dimensionality reduction and feature selection [24]. Here we use PCA to seek a low-rank approximation of the train operational data. In this step, the original 33 train operation features are transformed into a smaller set of new variables called principal components (PCs), which by concept retains similar amount of variation present in the original dataset. PCs are uncorrelated variables, ordered by their variance from the largest variance to the lowest one. Variables are referred to as PCs. Further define the level of contribution as ∑̃= 1 / ∑ =1 , ≤ , which represents the percentage of variation explained by the selected PCs. Therefore we can get a reasonable representation of the original data (e.g., with 80% level of contribution) with only a few PCs. Correlation analysis could be conducted to see the correlations between the PCs and the original features.

Clustering Analysis.
Fuzzy c-means clustering (FCM; see [25]) is then used to discover the interrelationship between the principal components (PCs) and the passenger load factor (PLF). The purpose of clustering is to put "similar" samples into the same group and to explore the patterns reflected by different groups. Let̃= {̃1,̃2, . . . ,̃} , = 133, be the transformed train samples, each has features; i.e., = {̃1,̃2, . . . ,̃}, = 1, 2, .., . FCM is used to divide these samples into clusters; each cluster is characterized by its sample mean, called the centroid. The approach is a standard and widely used data mining approach and is proven to be effective for knowledge discovery from a highdimensional dataset [26]. FCM does not require each data point to only belong to exactly one cluster; therefore it usually outperforms hard clustering methods (e.g., K-means) for overlapped dataset. The objective of FCM is to minimize the summation of weighted distance between each sample and the centroid of each cluster, as in formulation (6), i.e., to minimize the differences of the samples within the same cluster.
Here ∈ [1,∞) is the fuzzy factor that determines the fuzzy weight of the clustering results; is the degree of membership of̃i in cluster ; and is the centroid of cluster , in the -dimensional feature space. Note that the distance between each sample and each cluster centroid is measured   TS  RDR  DT  TT  SN  RDI  SS  RS  l1  l2  l3  l4  l5  l6  l7  l8  l9  l10  l11  l12  l13  l14  l15  l16  l17  l18  l19  l20  l21  l22  by the Euclidean norm as in (7), wherẽrepresents theth feature of the i-th transformed sample and denotes the location of centroid at the k-th dimension.
Fuzzy partitioning is carried out through an iterative optimization of the objective function shown in (6), with the updated degree of membership calculated using And the cluster centroid can be updated using The iterative algorithm terminates when ‖ ( +1) − ( ) ‖ ≤ , where is a stop criterion. ( ) is a × cluster centroid matrix, at iteration . This procedure also at least converges to a local minimum point of . It is noteworthy that the aforementioned procedure does not specify the number of clusters; the optimal number of clusters is determined based on the Xie-Beni coefficient [27] and Separation coefficient [28] in the experiment.

Experiment and Numerical Results
We first separate the samples into downward trains and upward trains. PCA and clustering techniques are then applied to these two datasets. A few interesting findings are generated from the exploratory data analysis and they are discussed in this section.

Downward Trains.
The downward trains represent trains travel from Beijing South (s1) to Shanghai Hongqiao (s24). PCA was firstly applied to the dataset. The cumulative level of contribution (with respect to PCs) is shown in Figure 2(a). It is found that PC1-PC3 account for more than 80% of the total variation. In Figure 2(b), it is shown that PC1 is strongly correlated (degree of correlation > 0.6) with a few features, including run duration (RDR), run distance (RDI), stop scheme (SS), and the sectional load coefficients 7 ∼ 23 , from Jinan West (JNW) station to Shanghai Hongqiao (SHHQ) station. Some other features such as Date and Run Speed (RS) are not strongly correlated with PC1. This indicates that PC1 and the strongly correlated features account for the highest variation in the data.
In the following experiment, we use PC1 and PLF for fuzzy c-means clustering. Two optimal clusters are found, which are plotted in Figure 3(a). It can be observed that higher PLF is associated with higher PC1. Since PC1 is positively correlated with RDR, RDI, SS, and 7 ∼ 23 , it can be further inferred that longer run distance/travel time, higher level of stop scheme (i.e., fewer stops), and higher sectional loading coefficients from JNW station to SZN station are associated with trains of higher PLF. Such inference can be validated by plotting the distributions of these original features for each cluster, as shown in Figures 3(b), 3(c), 3(d), and 4.
It is also noticed that cluster B in Figure 3(a) shows the multifurcated lines with different slopes, representing different rates of PLF to PC1. To further analyze the pattern, we used RDI as a surrogate of PC1 and applied the clustering model using PLF/RDI as the only feature. The results in Figure 5 have shown five clusters which correspond to the five linear lines shown in Figure 3(a). The results imply that the marginal effect of RDI gradually decreases; i.e., changing shortdistance trains to medium-distance trains seems to be more beneficial (in terms of the gain in PLF) compared to changing medium-distance trains to long-distance trains. This finding can be used to guide train scheduling.

Upward Trains.
The cumulative level of contribution of each PC is shown in Figure 6(a) for the upward trains from Shanghai Hongqiao (s24) to Beijing South (s1). We then conducted clustering analysis using PLF, PC1, and PC2. It is found that the optimal number of clusters is 3, as shown in Figure 6(b). Figure 7 shows the original features that are strongly correlated with PC1 and PC2. In particular, it is found that   PC2. For the upward trains, a few findings of TCU and passenger travel patterns can be put forward. As observed in Figure 6(b), compared to the samples with lower PLF (cluster B), trains with higher PLF (cluster A) are associated with larger PC1, indicating that higher SS (fewer stops), longer RDI, and higher LCs lead to better train capacity utilization. This is further verified in Figures 8(a), 8(b), and 8(c). Such finding is consistent with the downward trains.
The result in Figure 6(b) also shows that a cluster C, separated from the other two clusters, has large variation in the dimension of PC2. By further analyzing the distributions of DT (a surrogate of PC2), it is found that cluster C is associated with the samples that have early departure time (as shown in Figure 8(d)) and go through fewer sections/shorter distance (as shown in Figures 8(b) and 8(c)). These samples correspond to the extra (temporal) short-distance trains that depart in the early morning. We then rerun the PCA and clustering models only for cluster C samples to further explore the patterns of these extra trains. The results are shown in Figure 9.
It is shown that PLF and LCs ( 6 ∼ 1 ) are strongly correlated with PC1; date, DT, and LCs ( 16 ∼ 7 ) are strongly correlated with PC2. As in Figure 9(a), cluster C-2 and cluster C-3 are in the higher region of PC1; cluster C-1 is in the lower region of PC1. It is found that early of this quarter and early DT are associated with higher PLF with greater LCs ( 6 ∼ 1 ), as illustrated by cluster C-2; late of this quarter and relatively late DT also lead to the higher PLF with greater LCs. It is noteworthy that early of the quarter corresponds to the "Golden week" (Chinese national holiday) and late of the quarter is close to the New Year. Therefore, the extra trains with early or late departure time are better utilized in the holidays seasons compared to those in other seasons.
By scrutinizing Figure 8(c), it is found that the major trip attraction for cluster A trains is Beijing (as the load coefficient is high at section 1), and the major trip attraction for cluster B trains is the city of Xuzhou (XZE station), a medium-level city. Combining the patterns in Figures 8(a) and 8(c), it can be concluded that passengers traveling to Beijing prefer to choose the trains with fewer stops, most likely due to their higher value of time.

Concluding Remarks
This paper proposes an exploratory data mining approach to discover the influential features of TCU and understand the travel patterns using real world train operational data. Several interesting findings were reported in the paper, as summarized below.
(1) Run distance and stop scheme are found to be closely related to TCU. Per the specific dataset, trains with longer run distance and fewer stops result in higher TCU.
(2) The marginal effect of travel distance decreases in terms of the gain in TCU. Making the short-distance trains into medium-distance trains is more beneficial compared to making medium-distance trains into long-distance trains.
(3) The extra (temporal) trains are better utilized during the holiday seasons, and the extra trains in off-peak seasons are not as well-utilized.
(4) Passengers to major cities prefer trains with fewer stops. Such behavioral pattern can be explained by their value of time.
These findings, albeit case-specific, have shown that the proposed approach is a useful tool for data mining and knowledge discovery from train operational data and it can Journal of Advanced Transportation 9 be utilized to facilitate smarter decision-making for train operation and management.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.