Research on Rapid Congestion Identification Method Based on TSNE-FCM and LightGBM

Deng, Cheng; Zhang, Qiqian; Zhang, Honghai; Li, Jingyu; Ning, Changyuan

doi:10.3390/su151411322

Open AccessArticle

Research on Rapid Congestion Identification Method Based on TSNE-FCM and LightGBM

College of Civil Aviation, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(14), 11322; https://doi.org/10.3390/su151411322

Submission received: 6 June 2023 / Revised: 17 July 2023 / Accepted: 18 July 2023 / Published: 20 July 2023

(This article belongs to the Special Issue Sustainable Development of Airspace Systems)

Download

Browse Figures

Versions Notes

Abstract

:

The terminal area is a convergence point for inbound and outbound traffic, and it is characterized by a complex airspace structure and high traffic density. It is an area that frequently experiences flight congestion and ground delays. A system capable of the intelligent, reliable, timely, and accurate identification of air traffic congestion for air–ground coupled flight flow constitutes a key technology with respect to unlocking the potential capacity of the terminal area, mitigating traffic congestion, and assisting air-traffic-control-related decision making. Therefore, this article aims to extract and analyze the multi-scale and multi-dimensional evaluation indicators of air–ground coupled flight flow congestion, use the TSNE-FCM algorithm to classify congestion levels, and, based on this work, construct a real-time and fast congestion identification model using the LightGBM algorithm. The case study analyzed China Baiyun Airport (CAN), and the experimental results indicate the following: (1) The congestion level classification achieved using the TSNE-FCM algorithm is superior to that achieved using the FCM algorithm. Furthermore, flight delays predominantly occur in slightly congested and congested states. (2) The congestion identification model based on LightGBM outperforms the XGBoost, RandomForest, and ExtraTree models. The macro-average and micro-average AUC curve areas for the LightGBM model were 0.96 and 0.96, respectively. The LightGBM model demonstrates excellent performance and is suitable for identifying congestion levels in practical engineering applications.

Keywords:

TSNE-FCM; LightGBM; congestion identification; terminal area; air–ground coupled flight flow

1. Introduction

In the aftermath of the COVID-19 pandemic, the civil aviation transportation industry has encountered new opportunities for development, showing a trend of recovery and accelerated growth. However, airspace congestion has consistently been a bottleneck issue that hinders the development of China’s civil aviation transportation industry [1,2]. The terminal area is an intertwined region that is vertically coupled with airport surfaces and intricately connected to the air route network. Due to the convergence of traffic for arrivals and departures, the terminal area is characterized by a complex airspace structure and high traffic density. With regard to air–ground coupled flight flow in the terminal area, a system allowing for the intelligent, reliable, timely, and accurate identification of air traffic congestion constitutes a core technology with respect to unlocking potential capacity, alleviating traffic congestion, and assisting air-traffic-control-related decision making.

Therefore, the purpose of this study is to extract and analyze the attributes reflecting the nature of congestion regarding air–ground coupled flight flow in terminals. Based on this, this article aims to deeply integrate artificial intelligence, big data mining, and the civil aviation industry to research rapid identification methods for congestion situations and provide a basis from which more accurate and scientific decisions can be made with respect to air traffic control. Both domestic and international scholars have focused on three main aspects when researching the identification of air traffic congestion: threshold determination, clustering analysis, and machine learning.

1.1. Threshold Determination

Regarding threshold determination, researchers have drawn inspiration from methods used for identifying ground traffic congestion. They compare the actual traffic conditions in the airspace, such as the number of aircraft, aircraft spacing, and traffic demand, with monitoring thresholds known as Monitor Alert Parameters (MAP) to identify the nature of air traffic congestion. For example, in the United States, the Enhanced Traffic Management System (ETMS) is employed, which compares the traffic demand in different airspace units (terminal area sectors, critical points in the air route network, and airport surfaces) with MAP. When the actual demand exceeds the threshold, the airspace unit is considered congested [3]. Sun D, on the other hand, utilized modeling techniques and the mining of historical operational data to determine sector traffic demand and compared this demand with MAP to analyze the corresponding congestion situation [4]. However, this method has a certain degree of subjectivity. Unlike ground traffic, the air traffic of coupled flight flow is relatively complex. Even when the air traffic demand is below the threshold, the aircraft within the airspace unit may not have achieved a stable and orderly flow [5]. This leads to increased system entropy, enhanced disorderliness, higher complexity, and increased workload for air traffic controllers. The system is also susceptible to external disturbances, thereby increasing the likelihood of congestion. Additionally, this method overlooks important information embedded in historical trajectory data and the influence of uncertainty factors such as weather conditions and air traffic controller intent. Therefore, this study employs an approach combining clustering analysis and machine learning to achieve rapid and real-time identification of congestion situations.

1.2. Clustering Analysis and Machine Learning

Clustering analysis introduced a new research direction with respect to identifying traffic congestion situations. Jingqin Jiang, for instance, constructed congestion feature evaluation indicators from among multiple dimensions of “point, line, and volume”, and selected indicators through correlation analysis. The Fuzzy C-Means (FCM) algorithm was then utilized to identify congestion situations in the airspace of a terminal [6]. Guiyi Li combined machine learning with threshold determination and generated congestion identification rules for the terminal area using FCM-rough set [7]. Jinglin Dong applied the entropy method to weight the indicators and used the FCM algorithm to recognize congestion situations in the terminal area [8]. However, the real-time and rapid identification of air traffic congestion situations cannot be achieved using a single clustering algorithm. Therefore, a few scholars have employed clustering analysis to classify congestion levels and utilized machine learning algorithms to develop congestion identification models. These models take inputs in the form of evaluation indicators and temporal sequences of congestion levels, thereby enabling real-time and rapid identification of congestion situations. Guiyi Li established an evaluation indicator system for congestion situations, dividing the congestion levels using Fuzzy C-Means; subsequently, the cited author trained a congestion identification model for flight segments using ensemble learning methods [9]. Zheng Zhao also constructed an identification and prediction model of an air traffic network based on FCM and LightGBM [10].

1.3. Research Review Summary

The aforementioned achievements have laid a solid foundation for traffic congestion identification. Regarding the identification of traffic congestion situations based on clustering algorithms and machine learning, the effectiveness of clustering in dividing congestion levels directly affects the scientific validity and feasibility of the identification models. The existing research usually applies direct clustering or weighted clustering to high-dimensional feature vectors for the classification of congestion levels. However, it has been shown that traditional clustering algorithms exhibit significant performance and efficiency degradation when addressing high-dimensional spaces, commonly referred to as the “curse of dimensionality” phenomenon [11,12,13]. Therefore, directly applying clustering algorithms to high-dimensional feature vectors may jeopardize the reliability and scientific validity of classification results regarding congestion levels. In other engineering fields, many scholars have studied different topics by combining clustering algorithms with dimensionality reduction algorithms and achieved better results [14,15].

Therefore, to address the aforementioned issue, this study combines T-distributed Stochastic Neighbor Embedding (TSNE) with the Fuzzy C-Means (FCM) algorithm. The experimental results demonstrate that TSNE-FCM improves the quality of congestion level classification. This approach will lay a solid foundation for training real-time and rapid congestion identification models.

2. Methods

The overall approach of this study is as follows (a corresponding flowchart is shown in Figure 1):

Select the terminal area and determine the longitude, latitude, and altitude parameters. Collect ADS-B data.
Establish a multidimensional and multilevel congestion evaluation indicator system tailored to the air–ground coupled flight flow. Compute and construct a historical time series matrix of congestion evaluation indicators in high-dimensional space. Conduct a correlation analysis to identify indicators with low relevance.
Utilize the TSNE-FCM algorithm to classify congestion levels and generate a historical time series matrix of congestion levels in the terminal area.
Utilize the historical time series matrix of the congestion evaluation indicators in high-dimensional space and the historical time series matrix of congestion levels in the terminal area as inputs. Train and evaluate a congestion perception model based on LightGBM. This model will enable real-time and rapid identification of congestion situations of the air–ground coupled flight flow in the terminal area.

Figure 1. Flowchart of overall approach.

2.1. Definitions and Data

2.1.1. Concept Definitions

The air–ground coupled flight flows in the terminal area refers to the dynamic interaction and coordinated operations of aircraft involved in taxiing, takeoff, landing, departure, and approach within a specific spatial and temporal domain.

Congestion can be defined as the occurrence of aircraft clustering and delays within a specific spatial and temporal domain of the terminal area due to the conflicting demands of air traffic and traffic capacity as well as various other factors such as weather conditions. Congestion is characterized by aircraft engaging in a significant degree of following behavior, widespread accumulation and delays, and significant competition for airspace and human resources. There are four congestion levels in this paper: smooth state, normal state, slightly congested state, and congested state.

2.1.2. Index Definitions

Terminal area flow refers to the number of aircraft engaged in flight operations, engaging in climb-out upon departure, or approaching descent within a given unit of time in the terminal area, as expressed in Equation (1). Flow exhibits distinct spatiotemporal characteristics and can dynamically describe the operational features of air–ground coupled flight flows, thus reflecting the level of congestion in the terminal area to some extent:

Q_{t} = c o u n t {F l i g h t}_{t}

(1)

where

Q_{t}

is the flow in the terminal area, and

{F l i g h t}_{t}

is the aircraft set engaged in flight operations within a given unit of time.

Terminal retention refers to the difference between the inflow and outflow of traffic in the terminal area within a given unit of time divided by the outflow of traffic from the terminal area (as shown in Equation (2)). This metric serves as an indicator of the traffic service level within the terminal area and can provide insights into the dynamic process of congestion formation or dissipation.

R_{t} = \frac{(q_{t}^{i n} - q_{t}^{o u t})}{q_{t}^{o u t}}

(2)

In the equation above,

R_{t}

denotes terminal retention,

q_{t}^{i n}

is the traffic flow into the terminal area, and

q_{t}^{o u t}

is the traffic flow out of the terminal area within a given unit of time.

The average departure delay time (unit: seconds) is the average difference between the actual (or estimated) departure time and the scheduled departure time of aircraft operating in the airport movement area within a given unit of time, as shown in Equation (3):

D_{t} = \frac{\sum_{i = 1}^{d_{t}} (A_{t}^{i} - S_{t}^{i})}{d_{t}}, A_{t}^{i} > S_{t}^{i}

(3)

where

D_{t}

is the average departure delay time;

A_{t}^{i}

is the actual (or estimated) departure time of the departing aircraft

i

;

S_{t}^{i}

is the scheduled departure time of the departing aircraft

i

; and

d_{t}

is the number of aircraft delayed from departure within a given unit of time.

The average horizontal flight speed (unit: kilometers per hour) is the average speed of aircraft in the terminal area within a given unit of time, as expressed in Equation (4):

\bar{V_{t}} = \frac{\sum_{i = 1}^{Q_{t}} V_{t}^{i}}{Q_{t}}

(4)

where

\bar{V_{t}}

is the average horizontal flight speed and

V_{t}^{i}

is the horizontal flight speed of the aircraft

i

within a given unit of time.

The average flight time (unit: seconds) is the average duration of flight for aircraft in the terminal area within a given unit of time, as shown in Equation (5):

\bar{T_{t}} = \frac{\sum_{i = 1}^{Q_{t}} T_{t}^{i}}{Q_{t}}

(5)

where

\bar{T_{t}}

is the average flight time and

T_{t}^{i}

is the flight time of the aircraft

i

within a given unit of time.

The average vertical flight speed (unit: inches per minute) is the average rate of descent for aircraft in the approach phase within a given unit of time, as shown in Equation (6):

\bar{S_{t}} = \frac{\sum_{i = 1}^{n_{t}} S_{t}^{i}}{n_{t}}

(6)

where

\bar{S_{t}}

is the average vertical flight speed;

S_{t}^{i}

is the vertical flight speed of an aircraft

i

within a given unit of time; and

n_{t}

is the number of aircraft in the approach phase within a given unit of time.

2.1.3. Data Collection

This study focuses on the terminal area of Guangzhou Baiyun Airport (CAN), for which actual operational data were collected and organized from 1 July 2019 to 28 July 2019, with a time interval of 15 min. A case analysis was conducted to validate the effectiveness of the proposed algorithm. Guangzhou Baiyun Airport is one of the largest aviation hubs in southern China. In July 2019, the airport handled a total of 41,817 inbound and outbound flights, constituting 20,899 inbound flights and 20,918 outbound flights. The collected data include flight number, airport of departure, destination airport, scheduled departure time, horizontal flight speed, vertical flight speed, longitude, latitude, heading, and timestamp.

2.2. Congestion Level Classification Model Based on TSNE-FCM

To address the issue mentioned in the introduction, this article combines the TSNE algorithm with the FCM algorithm, resulting in the TSNE-FCM algorithm. The algorithmic process is shown in Figure 2.

2.2.1. TSNE

The TSNE (T-Distributed Stochastic Neighbor Embedding) algorithm employs a nonlinear dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional space, typically of two or three dimensions, while preserving the information carried by the high-dimensional data [16,17,18]. Its main function is to convert the similarity between data points in the high-dimensional space into a Gaussian joint probability distribution and convert the similarity between data points in the low-dimensional space into a Student’s t-distribution with one degree of freedom. The algorithm was designed to minimize the Kullback–Leibler (KL) divergence between these two distributions by iteratively updating the embeddings using gradient descent [19]. Through this iterative process, the algorithm produces a lower-dimensional representation of the data. The steps of the process are as follows:

Initialize the high-dimensional congestion evaluation indicator matrix, denoted as $X$ , using Equation (7) and the low-dimensional mapping evaluation matrix, denoted as $Y$ , using Equation (8);

$X = (Q, \bar{V}, \bar{T}, \bar{S}, D, R)$

(7)

$Y = (y_{1}, y_{2})$

(8)
Calculate the similarity between each pair of sample points in the high-dimensional indicator matrix $X$ and transform it into a Gaussian joint distribution probability, denoted as $p_{i j}$ , using Equations (9) and (10);

$p_{i j} = \frac{p_{j ∣ i} + p_{i ∣ j}}{2 n}$

(9)

$p_{j ∣ i} + p_{i ∣ j} = \frac{\exp (- {‖ x_{i} - x_{j} ‖}^{2} / 2 σ^{2})}{\sum_{k \neq l} \exp (- {‖ x_{k} - x_{l} ‖}^{2} / 2 σ^{2})}$

(10)
Calculate the similarity between each pair of sample points in the low-dimensional mapping evaluation matrix $Y$ and transform it into a Student’s t-distribution probability distribution with one degree of freedom, denoted as $q_{i j}$ , using Equation (11).

$q_{i j} = \frac{{(1 + {‖ y_{i} - y_{j} ‖}^{2})}^{- 1}}{\sum_{k \neq l} {(1 + {‖ y_{k} - y_{l} ‖}^{2})}^{- 1}}$

(11)
Calculate the Kullback–Leibler (KL) distance between the probability distributions of each sample point in the high-dimensional indicator matrix $X$ and the probability distributions of each sample point in the low-dimensional mapping evaluation matrix $Y$ , as shown in Equation (12), and compute the gradients, as shown in Equation (13);

$\min C = K L (P ‖ Q) = \sum_{i} \sum_{j} p_{i j} \lg \frac{p_{i j}}{q_{i j}}$

(12)

$\frac{\partial C}{\partial y_{i}} = 4 \sum_{j} (p_{i j} - q_{i j}) (y_{i} - y_{j}) {(1 + {‖ y_{i} - y_{j} ‖}^{2})}^{- 1}$

(13)
Update the low-dimensional mapping evaluation matrix $Y$ based on the gradient information, as shown in Equation (14);

$Y^{(t)} = Y^{(t - 1)} + η \frac{\partial C}{\partial Y} + α (t) (Y^{(t - 1)} - Y^{(t - 2)})$

(14)
Repeat steps 2–5 until the specified number of iterations is reached or the convergence condition is met;
Output the final low-dimensional mapping evaluation matrix $Y$ , which represents the dimensionality-reduced congestion evaluation indicator matrix.

2.2.2. FCM

The Fuzzy C-Means algorithm is a target-based fuzzy clustering algorithm [20]. It is used for the classification of congestion levels based on the low-dimensional space mapping evaluation matrix obtained. In this study, the low-dimensional spatial mapping evaluation matrix obtained from the previous TSNE algorithm is subjected to a Fuzzy C-Means algorithm for congestion level classification. The steps of this process are as follows:

Initialize the number of clustering categories, which corresponds to the number of congestion levels, the stopping threshold, the iteration count, the weighting exponent, the membership matrix, and the cluster centers.
Iterate with the goal of minimizing the value function, as shown in Equation (15):

$\min J (U, V) = \sum_{i = 1}^{m} \sum_{j = 1}^{c} {(u_{i j})}^{k} {‖ y_{i} - v_{j} ‖}^{2} s . t {\begin{matrix} \sum_{j = 1}^{c} u_{i j} = 1, 1 \leq i \leq m \\ 0 \leq u_{i j} \leq 1, 1 \leq j \leq C, 1 \leq i \leq m \\ 0 < \sum_{i = 1}^{m} u_{i j} < m, 1 \leq j \leq C \end{matrix}$

(15)

Update the membership matrix, as shown in Equation (16), and the cluster centers, as shown in Equation (17):

u_{i j} = {(\sum_{l = 1}^{c} {(\frac{‖ y_{i} - v_{j} ‖}{‖ y_{i} - v_{l} ‖})}^{\frac{2}{k - 1}})}^{- 1}

(16)

v_{j} = \frac{\sum_{i = 1}^{m} {(u_{i j})}^{k} y_{i}}{\sum_{i = 1}^{m} {(u_{i j})}^{k}}

(17)

where

c

is the number of clusters,

u_{i j}

is the membership matrix,

v_{j}

is the cluster center,

m

is the number of sample points, and

k

is the fuzzy index.

3.: Output the congestion level historical time series matrix.

2.3. Congestion Identification Model Based on LightGBM

2.3.1. LightGBM

LightGBM (Light Gradient Boosting Machine) is an improved version of the Gradient Boosting Decision Tree (GBDT) algorithm, which is an ensemble learning method [21,22]. Public experiments have shown that LightGBM has significant advantages over other gradient boosting tree algorithms in terms of faster training speed and lower memory consumption. Even with high-dimensional features and a large number of samples, LightGBM can still achieve high accuracy. What is more, this algorithm has achieved remarkable results in practical engineering applications. Therefore, this article applies LightGBM algorithm to air traffic congestion identification. The core principles of LightGBM are as follows and are presented in Figure 3:

Gradient-based learning: LightGBM optimizes the loss function using a gradient-based approach. It calculates the gradients of the loss function with respect to the predictions and updates the model in a way that minimizes the loss, as shown in Equations (18) and (19):

$F_{m} (X) = \partial_{0} f_{0} (X) + \partial_{1} f_{1} (X) + \dots + \partial_{m} f_{m} (X)$

(18)

$L [F_{m} (X), Y] < L [F_{m - 1} (X), Y]$

(19)

where $m$ is the number of sub-models, $F_{m} (X)$ is the composite model, $f_{m} (X)$ is the sub-model, and $L [F_{m} (X), Y]$ is the loss function.
Leaf-wise tree growth: LightGBM grows the trees using a variant of the leaf-wise manner instead of the traditional level-wise approach. It chooses the leaf node that reduces the loss to the greatest degree during each tree-growing iteration, resulting in a more precise and compact model.
Gradient-based one-side sampling (GOSS): GOSS is a technique used in LightGBM to reduce memory consumption and increase training speed. It retains a small portion of the data with large gradients and samples the remaining data with smaller gradients, thereby retaining important information while discarding less-informative data.
Exclusive feature bundling (EFB): EFB is another technique used in LightGBM to handle high-dimensional data efficiently. It groups features with similar values into bundles, thus reducing the number of features and improving the training efficiency.
Histogram-based decision tree algorithm: LightGBM uses a histogram-based approach to build decision trees. It discretizes the numerical features into bins and uses histograms to store the statistics of feature values, which accelerates the tree construction process and reduces memory usage.

2.3.2. Receiver Operating Characteristic Curve and Area under the Curve

The Receiver Operating Characteristic (ROC) curve is a tool used to evaluate the performance of classification models. In practical engineering applications, an air traffic congestion dataset may suffer from a class imbalance, where the distribution of data samples across different congestion levels is uneven [23,24]. The ROC curve is robust and maintains its stability even in the presence of a class imbalance. Therefore, this article adopts the ROC curve as an evaluation metric.

The ROC curve is based on the relationship between the True Positive Rate (TPR), as shown in Equation (20), and the False Positive Rate (FPR), as shown in Equation (21). In the context of binary classification tasks, the ROC curve is constructed by varying the classification threshold and calculating the corresponding TPR and FPR. However, for multi-class tasks, the problem is typically transformed into a series of binary classification tasks; then, the ROC curve can be plotted.

T P R = T P / (T P + F N)

(20)

F P R = F P / (F P + T N)

(21)

The AUC (Area Under the Curve) refers to the area under the ROC curve, which is commonly used to evaluate the performance of classification models. The significance of the AUC is that it can measure the ability of a classification model to distinguish between positive and negative samples at different thresholds. The values of AUC range from 0 to 1, where a value closer to 1 indicates better model performance. The potential interpretations of AUC values are as follows:

AUC = 1: The model perfectly discriminates between positive and negative samples, meaning that for any comparison between a positive and a negative sample, the model can correctly predict their order.

AUC > 0.5: The model can distinguish between positive and negative samples to some extent. A higher AUC value indicates better model performance.

AUC = 0.5: The model’s predictive performance is equivalent to random guessing, indicating that the model cannot differentiate between positive and negative samples.

AUC < 0.5: The model’s predictive performance is worse than random guessing, suggesting the need to reconsider the model selected or make adjustments. The process of constructing an ROC curve is as follows:

1.

Input the test set into the trained LightGBM congestion identification model and calculate the probabilities of each sample being classified as being in either a smooth state, normal state, slightly congested state, or congested state.

2.

Select the correct sample category and transform the multi-class task into a binary classification task. For example, consider traffic congestion level 1 (smooth state) to be the correct sample, and consider the other levels, namely, 2, 3, and 4 (normal state, slightly congested state, and congested state), to be incorrect samples. Define TP, FP, TN, and FN as follows:

True Positive (TP): The number of samples correctly classified as smooth state.
False Positive (FP): The number of samples incorrectly classified as smooth state from among other categories.
True Negative (TN): The number of samples correctly classified as other categories.
False Negative (FN): The number of samples incorrectly classified as other categories when the sample in question is actually in the smooth state.

3.

Calculate TPR and FPR.

4.

Plot the TPR and FPR of the classification model at different thresholds, forming the ROC curve and calculating the AUC specifically for the smooth state.

5.

Repeat Steps 2 to 4 to plot the ROC curves and calculate the AUC for the normal state, the slightly congested state, and the congested state. Finally, use the macro-average method and micro-average method to plot the overall ROC curve and calculate the AUC.

3. Results and Discussion

3.1. Congestion Evaluation Indicators

Table 1 shows the high-dimensional congestion evaluation indicator matrix, denoted as

X

.

Q

is the terminal area flow feature vector;

\bar{V}

is the average horizontal flight speed feature vector;

\bar{T}

is the average flight time feature vector;

\bar{S}

is the average vertical flight speed feature vector;

D

is the average departure delay time feature vector; and

R

is the terminal retention feature vector.

To avoid dimensionality confusion and redundant calculations and improve the accuracy of the algorithm, this study utilizes the Pearson correlation coefficient to analyze the correlations regarding the high-dimensional congestion evaluation indicator matrix, as shown in Figure 4. Weak correlations were observed between the terminal retention feature vector and the terminal area flow feature vector as well as between the average horizontal flight speed feature vector and the average flight time feature vector. The absolute values of the correlation coefficients between the remaining feature vectors are small, indicating that there is no significant correlation. Therefore, this study retains the selected six feature vectors.

3.2. Congestion Level Classification

Due to the cyclical nature of flight operations, this study utilized the TSNE-FCM algorithm to classify congestion levels in the aviation sector based on a weekly period. The time slots were divided as follows: the first week corresponds to time slots 1–672, the second week corresponds to time slots 673–1344, the third week corresponds to time slots 1345–2016, and the fourth week corresponds to time slots 2017–2688.

Initially, the high-dimensional congestion evaluation indicator matrix is sliced based on weeks, and each slice is then subjected to dimensionality reduction using the TSNE algorithm. The data are reduced to two dimensions and normalized. KL scatter is a measure of the loss of information whose value range is from 0 to positive infinity, and the smaller the value of KL, the more information is retained, as shown in Table 2.

As a result, we obtained four low-dimensional congestion evaluation indicator matrices corresponding to the four weeks, as shown in Table 3.

Then, we applied both the TSNE-FCM algorithm and the FCM algorithm to classify congestion levels and determine the respective silhouette coefficients for the following four weeks. The results are presented in the following table. In the context of the table provided, the silhouette coefficients indicate the effectiveness of the TSNE-FCM algorithm and FCM algorithm with respect to clustering the congestion levels for each respective week. The silhouette coefficients obtained from the TSNE-FCM algorithm were consistently higher than those obtained from the FCM algorithm, as shown in Table 4. Higher coefficients generally indicate better clustering results, implying that the TSNE-FCM algorithms have successfully identified distinct congestion levels for each week.

To evaluate the effectiveness of the algorithm in this study and integrate it with practical engineering, the number of delayed flights was selected as a comparative indicator. Stacked bar charts were used to visualize the congestion levels and the corresponding number of delayed flights for 1 July and 15 July, as shown in Figure 5.

Through comparative analysis, it was determined that the TSNE-FCM algorithm performs better. Terminal area congestion exhibits complex development, propagation, dissipation, and coupling mechanisms. The TSNE-FCM algorithm achieves a better dispersion of congestion levels compared to the FCM algorithm as shown in Figure 5a,c. Using the FCM algorithm to directly classify congestion levels would result in a stacked distribution of congestion levels as shown in Figure 5b,d. Considering the actual operational conditions, congestion levels fluctuate over time due to various external and internal factors and should not be accumulated. For example, on 1 July, the congestion levels classified by the TSNE-FCM algorithm irregularly fluctuated over time, whereas the FCM algorithm showed a clear stacking of congestion levels, with a continuous degree of congestion corresponding to level 4 (congested state) lasting 5.5 h (22 time slots), which clearly does not match the actual operation conditions.

A comparative analysis was conducted by comparing the congestion levels with the corresponding number of delayed flights in the same time slots. The congestion trend identified by the TSNE-FCM algorithm aligns well with the trend of the flight delays as shown in subfigures (a and c): congestion level 1 (smooth state) corresponds to minimal flight delays, congestion level 2 (normal state) corresponds to a few flight delays, while congestion levels 3 (slightly congested state) and 4 (congested state) experience a higher concentration of flight delays. However, there is a significant discrepancy between the congestion trend identified by the FCM algorithm and the trend of the flight delays. In particular, congestion level 1 (smooth state) according to the FCM algorithm exhibits a considerable number of flight delays. Among the FCM algorithm’s results, out of the 621 instances classified as smooth flow, 385 instances (61.99%) experienced flight delays. In contrast, among the TSNE-FCM algorithm’s results, out of the 432 instances classified as smooth flow, only 56 instances (12.96%) experienced flight delays.

3.3. Congestion Identification

This paper uses the public mortgage default data set provided by Kaggle (https://www.kaggle.com/datasets, accessed on 5 July 2023) to evaluate the performance of the selected LightGBM model. Clearly, the LightGBM model not only offers high accuracy but also a fast execution time, which is 10 times faster than the mainstream XGBoost model, as shown in Table 5. This means that we can apply this model in the field of air traffic congestion identification to quickly and accurately identify air traffic congestion levels.

The congestion level time series values, as classified by the TSNE-FCM algorithm, were returned to their corresponding high-dimensional congestion evaluation indicator matrices to obtain the training set and the test set for the congestion identification model, as shown in Table 6.

The dataset of the congestion perception model was divided in a 4:1 ratio, with 80% serving as the training set and 20% serving as the test set. The training set was then used to train the LightGBM, XGBoost, RandomForest, and ExtraTrees classifiers. Grid search and K-fold cross-validation were employed to optimize some of the parameters, while the remaining parameters were set to their default values (provided by the Scikit-Learn library). The optimal parameters obtained are presented in Table 7.

The trained models were evaluated using ROC curves, and the resulting graph is shown in Figure 6. The area under the ROC curve (AUC) is indicated in Figure 6a,b. Figure 6c–f represent the roc curves of the four classifiers using LightGBMClassifier, XGBoostClassifier, RandomForestClassifier and ExtraTreesClassifier respectively, which mean the closer the curve is to the upper left corner, the better the model performance. The LightGBM model exhibits superior performance compared to the other three models in terms of the AUC for each congestion level (smooth state: 0.98, normal state: 0.96, slight congested: 0.96, and congested state: 0.92) and in terms of macro-average (0.96) and micro-average (0.96) AUC.

The AUC values indicate the overall discriminatory power and performance of the models with respect to distinguishing between different congestion levels. The higher the AUC, the better the model’s ability to classify and predict the congestion states accurately. In this case, the LightGBM model demonstrates strong performance across all congestion levels, resulting in higher AUC values and indicating its effectiveness in the task of congestion evaluation in the aviation domain.

4. Conclusions

This study focused on intelligent, reliable, timely, and accurate air traffic congestion identification methods. Firstly, it detailed the extraction and analysis of the multi-scale and multi-dimensional quantitative evaluation indicators of the congestion situation in the terminal area and the ground–air coupled flight flow. Secondly, this study combined the t-distributed stochastic neighbor embedding (TSNE) algorithm with the fuzzy c-means (FCM) algorithm to classify the congestion situation into different levels. Lastly, based on this foundation, a real-time, rapid identification congestion situation model based on LightGBM was constructed. The experiments have provided the following findings.

The TSNE-FCM algorithm improves the quality of congestion level classification, yielding silhouette coefficients superior to those of the FCM algorithm. The congestion level results obtained from this algorithm better reflect the complex development, evolution, propagation, and dissipation mechanisms of congestion phenomena in the terminal area, revealing that flight delays are mostly concentrated in the slightly congested and congested states.

The congestion identification model based on LightGBM outperforms the XGBoost, RandomForest, and ExtraTree models. The macro-average and micro-average AUC areas of the LightGBM model were 0.96 and 0.96, respectively. The excellent performance of the LightGBM model makes it more suitable for practical engineering applications concerning the identification of congestion levels.

Author Contributions

Methodology, C.D., Q.Z. and C.N.; Validation, C.D., H.Z. and J.L.; Resources, Q.Z. and H.Z.; Data curation, C.D.; Writing – original draft, C.D.; Writing – review & editing, Q.Z.; Visualization, C.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China grant number U2133207.

Data Availability Statement

Link to publicly archived dataset: https://www.kaggle.com/datasets/yasserh/loan-default-dataset, accessed on 5 July 2023. ADS-B data is not publicly available due to national confidentiality issues.

Conflicts of Interest

The authors declare no conflict of interest.

References

Vaaben, B.; Larsen, J. Mitigation of airspace congestion impact on airline networks. J. Air Transp. Manag. 2015, 47, 54–65. [Google Scholar] [CrossRef] [Green Version]
Idris, H.; Dao, Q.; Rorie, C.; Hashemi, K.; Mogfords, R. A Framework for Assessment of Autonomy Challenges in Air Traffic Management; AIAA Aviation Forum: San Diego, CA, USA, 2020. [Google Scholar]
Volpe National Transportation Systems Center. Enhanced Traffic Management System (ETMS): Functional Description; U.S. Dept of Transportation: Cambridge, MA, USA, 2002.
Sun, D.; Bayen, A.M. Multicommodity Eulerian-Lagrangian Large-Capacity Cell Transmission Model for En Route Traffic. J. Guid. Control Dyn. 2008, 31, 616–628. [Google Scholar] [CrossRef]
Li, S. Research on Identification and Prediction Methods of Air Traffic Congestion. Doctoral Dissertation, Tianjin University, Tianjin, China, 2014. [Google Scholar]
Jiang, J.; Zhang, H.; Qiu, Q. Evaluating Method of Air Traffic Congestion State of Approaching Traffic Flow in Terminal Areas. J. Wuhan Univ. Technol. (Transp. Sci. Eng.) 2016, 40, 1043–1049. [Google Scholar]
Li, G.; Hu, M.; Zheng, Z. Multi-sector Traffic Congestion Identification Method Based on FCM-rough Sets. J. Transporation Syst. Eng. Inf. Technol. 2017, 17, 141–146. [Google Scholar]
Dong, J. Traffic Situation Analysis in Terminal Area Based on ADS-B Data. Master’s Thesis, Civil Aviation University of China, Tianjin, China, 2020. [Google Scholar]
Li, G.; Guo, M.; Luo, Y. Traffic Congestion Identification of Air Route Network Segment Based on Ensemble Learning Algorithms. J. Transporation Syst. Eng. Inf. Technol. 2020, 20, 166–173. [Google Scholar]
Zhao, Z.; Yuan, J.; Liu, Y. Research and comparison on identification and prediction methods of air traffic network congestion. In Proceedings of the 2022 4th International Academic Exchange Conference on Science and Technology Innovation (IAECST), Guangzhou China, 9–11 December 2022; pp. 1256–1264. [Google Scholar]
Ambika, P.R.; Malakreddy, A.B. Optimisation of sub-space clustering in a high dimension data using Laplacian graph and machine learning. Int. J. Bioinform. Res. Appl. 2022, 18, 68–83. [Google Scholar] [CrossRef]
Beyer, K.; Goldstein, J.; Ramakrishnan, R.; Shaft, U. When is “nearest neighbor” meaningful? lecture notes in computer science. Lect. Notes Comput. Sci. 1998, 15, 593–602. [Google Scholar]
Hinneburg, A.; Aggarwal, C.; Keim, D.A. What is the Nearest Neighbor in High Dimensional Spaces? Pennsylvania State University: Centre County, PA, USA, 2000. [Google Scholar]
Niu, T.; Huang, W.; Zhang, C.; Zeng, T.; Chen, J.; Li, Y.; Liu, Y. Study of degradation of fuel cell stack based on the collected high-dimensional data and clustering algorithms calculations. Energy AI 2022, 10, 100184. [Google Scholar] [CrossRef]
Shamim, G.; Rihan, M. Exploratory Data Analytics and PCA-Based Dimensionality Reduction for Improvement in Smart Meter Data Clustering. IETE J. Res. 2023, 10. [Google Scholar] [CrossRef]
Bocker, M.; Grushko, M.G.; Arline, K.E. Toward improved cancer classification using PCA plus tSNE dimensionality reduction on bulk RNA-seq data. Cancer Res. 2022, 82, 2. [Google Scholar] [CrossRef]
Sen, X.U.; Hua, X.; Jing, X.U.; Xiufang, X.U.; Gao, J.; Jing, A.N. Cluster Ensemble Approach Based on T-distributed Stochastic Neighbor Embedding. J. Electron. Inf. Technol. 2018, 40, 1316–1322. [Google Scholar]
Wu, Y.; Feng, X.; Dou, Y.; Zhu, R.; Gao, T. Density Peak clustering algorithm based on t-SNE Optimization. J. Phys. Conf. Ser. 2019, 1237, 022162. [Google Scholar] [CrossRef]
Ding, J.; Zhang, B.; Wang, X.; Zhou, C. TSNE: Trajectory similarity network embedding. In Proceedings of the SIGSPATIAL ’22: The 30th International Conference on Advances in Geographic Information Systems, Seattle, WA, USA, 1–4 November 2022. [Google Scholar]
Xue, J.J.; Nie, F.P.; Wang, R.; Li, X.L. Iteratively Reweighted Algorithm for Fuzzy $c$-Means. IEEE Trans. Fuzzy Syst. 2022, 30, 4310–4321. [Google Scholar] [CrossRef]
Liang, J.C.; Bu, Y.D.; Tan, K.F.; Pan, J.C.; Yi, Z.P.; Kong, X.M.; Fan, Z. Estimation of Stellar Atmospheric Parameters with Light Gradient Boosting Machine Algorithm and Principal Component Analysis. Astron. J. 2022, 163, 12. [Google Scholar] [CrossRef]
Hancock, J.; Khoshgoftaar, T.M. Leveraging LightGBM for Categorical Big Data. In Proceedings of the 2021 IEEE Seventh International Conference on Big Data Computing Service and Applications (BigDataService), Oxford, UK, 23–26 August 2021; pp. 149–154. [Google Scholar]
Jaskowiak, P.A.; Costa, I.G.; Campello, R. The area under the ROC curve as a measure of clustering quality. Data Min. Knowl. Discov. 2022, 36, 1219–1245. [Google Scholar] [CrossRef]
Liu, S.; Yang, J.J.; Zeng, X.X.; Song, H.Y.; Cen, J.; Xu, W.C. An efficient and user-friendly software tool for ordered multi-class receiver operating characteristic analysis based on python. SoftwareX 2022, 19, 8. [Google Scholar] [CrossRef]

Figure 2. Flow chart of TSNE-FCM algorithm.

Figure 3. Diagram depicting the principle behind the LightGBM algorithm.

Figure 4. The correlations in relation to the high-dimensional congestion evaluation indicator matrix.

Figure 5. A comparison using stacked bar charts of the number of delayed flights and congestion levels.

Figure 6. The ROC curves and AUC values.

Table 1. The high-dimensional congestion evaluation indicator matrix.

Time Slots	$Q$	$\bar{V}$	$\bar{T}$	$\bar{S}$	$D$	$R$
1	4.00	518.79	360.00	−265.54	900.00	−0.93
2	10.00	606.93	326.67	−206.78	0.00	−0.77
3	17.00	607.20	267.69	−353.81	660.00	−0.38
4	23.00	606.47	346.96	−593.65	0.00	−0.55
5	25.00	613.18	337.50	−710.99	1380.00	0.43
…	…	…	…	…	…	…
2688	43.00	536.04	370.24	−774.71	696.00	−1.00

Table 2. KL Scatter.

Timestamp	KL Scatter
The first week	0.2829
The second week	0.3029
The third week	0.3201
The fourth week	0.2807

Table 3. The low-dimensional congestion evaluation indicator matrices.

The First Week			…	The Fourth Week
Time Slots	Feature Vector A	Feature Vector B	…	Time Slots	Feature Vector G	Feature Vector H
1	1	0.511	…	2017	0.138	0.320
2	0.501	0.163	…	2018	0.191	0.479
3	0.995	0.495	…	2019	0.302	0.166
4	0.488	0.158	…	2020	0.228	0.246
5	0.818	0.916	…	2021	0.334	0.109
…	…	…	…	…	…	…
672	0.484	0.147	…	2688	0.113	0.548

Table 4. Comparison between TSNE-FCM and FCM.

Timestamp	TSNE-FCM Silhouette Coefficients	FCM Silhouette Coefficients
The first week	0.7561	0.3334
The second week	0.7474	0.1871
The third week	0.7379	0.3248
The fourth week	0.6783	0.2989

Table 5. The results of the performance comparison of the classification models.

	LightGBM	XGBoost	RandomForest	ExtraTrees
AUC	0.85	0.84	0.76	0.75
Execution time	0.3 s	3.2 s	1.5 s	0.8 s

Table 6. The training set and the test set.

Time Slots	$Q$	$\bar{V}$	$\bar{T}$	$\bar{S}$	$D$	$R$	Congestion Levels
1	4.00	518.79	360.00	−265.54	900.00	−0.93	2
2	10.00	606.93	326.67	−206.78	0.00	−0.77	1
3	17.00	607.20	267.69	−353.81	660.00	−0.38	2
4	23.00	606.47	346.96	−593.65	0.00	−0.55	1
5	25.00	613.18	337.50	−710.99	1380.00	0.43	4
…	…	…	…	…	…	…
2688	43.00	536.04	370.24	−774.71	696.00	−1.00	3

Table 7. The optimal parameters.

Models
LightGBM	learning_rate = 0.01
	max_depth = 3
	n_estimators = 150
	num_leaves= 15
XGBoost	learning_rate = 0.1
	max_depth = 3
	min_child_weight = 2
	n_estimators = 50
RandomForest	max_depth = 12
	min_samples_leaf = 20
	min_samples_split = 120
	n_estimators = 200
ExtraTrees	max_depth = 9
	min_samples_leaf = 10
	min_samples_split = 100
	n_estimators = 50

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Deng, C.; Zhang, Q.; Zhang, H.; Li, J.; Ning, C. Research on Rapid Congestion Identification Method Based on TSNE-FCM and LightGBM. Sustainability 2023, 15, 11322. https://doi.org/10.3390/su151411322

AMA Style

Deng C, Zhang Q, Zhang H, Li J, Ning C. Research on Rapid Congestion Identification Method Based on TSNE-FCM and LightGBM. Sustainability. 2023; 15(14):11322. https://doi.org/10.3390/su151411322

Chicago/Turabian Style

Deng, Cheng, Qiqian Zhang, Honghai Zhang, Jingyu Li, and Changyuan Ning. 2023. "Research on Rapid Congestion Identification Method Based on TSNE-FCM and LightGBM" Sustainability 15, no. 14: 11322. https://doi.org/10.3390/su151411322

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Rapid Congestion Identification Method Based on TSNE-FCM and LightGBM

Abstract

1. Introduction

1.1. Threshold Determination

1.2. Clustering Analysis and Machine Learning

1.3. Research Review Summary

2. Methods

2.1. Definitions and Data

2.1.1. Concept Definitions

2.1.2. Index Definitions

2.1.3. Data Collection

2.2. Congestion Level Classification Model Based on TSNE-FCM

2.2.1. TSNE

2.2.2. FCM

2.3. Congestion Identification Model Based on LightGBM

2.3.1. LightGBM

2.3.2. Receiver Operating Characteristic Curve and Area under the Curve

3. Results and Discussion

3.1. Congestion Evaluation Indicators

3.2. Congestion Level Classification

3.3. Congestion Identification

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI