An Incremental Clustering Method for Anomaly Detection in Flight Data

Safety is a top priority for civil aviation. New anomaly detection methods, primarily clustering methods, have been developed to monitor pilot operations and detect any risks from such flight data. However, all existing anomaly detection methods are offlline learning - the models are trained once using historical data and used for all future predictions. In practice, new flight data are accumulated continuously and analyzed every month at airlines. Clustering such dynamically growing data is challenging for an offlline method because it is memory and time intensive to re-train the model every time new data come in. If the model is not re-trained, false alarms or missed detections may increase since the model cannot reflect changes in data patterns. To address this problem, we propose a novel incremental anomaly detection method based on Gaussian Mixture Model (GMM) to identify common patterns and detect outliers in flight operations from digital flight data. It is a probabilistic clustering model of flight operations that can incrementally update its clusters based on new data rather than to re-cluster all data from scratch. It trains an initial GMM model based on historical offlline data. Then, it continuously adapts to new incoming data points via an expectation-maximization (EM) algorithm. To track changes in flight operation patterns, only model parameters need to be saved. The proposed method was tested on three sets of simulation data and two sets of real-world flight data. Compared with the traditional offline GMM method, the proposed method can generate similar clustering results with significantly reduced processing time (57 % - 99 % time reduction in testing sets) and memory usage (91 % - 95 % memory usage reduction in testing sets). Preliminary results indicate that the incremental learning scheme is effective in dealing with dynamically growing data in flight data analytics.


Introduction
Recently, flight data analytics has gained great attention in the aviation industry for safety management and efficiency improvement (Kang & Hansen, 2018;Qian et al., 2017;Sun et al., 2019). A number of machine learning methods have been developed to recognize patterns and (or) detect anomalies in massive amounts of operational data generated by computer systems onboard and on the ground, including digital flight data recorder (FDR) data and aircraft position tracking data by radar or Automatic Dependent Surveillance-Broadcast (ADS-B), etc. These methods help to unravel patterns in flight operations and aircraft movement, and gain a better understanding of aircraft system conditions, pilot behaviors, and traffic flow dynamics. The results can be used to monitor system health conditions, detect any safety risks, and inform improvement strategies.
Among various digitized operational data, the digital flight data recorded by Quick Access Recorder (QAR) or FDR records detailed and comprehensive information of an airplane throughout a flight. As shown in Figure 1, on Figure 2: Core concept of cluster analysis and anomaly detection in ClusterAD-Flight and ClusterAD-DataSample (Li et al., 2015(Li et al., , 2016 offline methods requires extremely large memory capacity and long computational time because data accumulated over months need to be repeatedly processed and analyzed. Incremental methods offer a better choice. Incremental methods process data elements one (or a small amount) at a time and need much less memory space than the offline methods which store the whole dataset. The only work that we found to address the data size problem in flight data analysis was the Logarithmic multivariate Gaussian models developed by Li et al. (2020) to detect anomalies in flight data via a mini-batch training process. However, it could not deal with changes in the number of clusters. Therefore, this work aims at further development in this direction, an incremental clustering method that can process the data and update its model (including cluster number and cluster structure) online as new data come in.
Incremental methods receive data elements or batches one at a time and typically use much less space than what is needed to store the complete data set. Incremental clustering aims to identify inherent structures of the whole dataset, yet can only observe a few data points each time. Several papers discussed the challenges in developing incremental clustering methods (Ackerman & Dasgupta, 2014). Meanwhile, many incremental or online clustering algorithms have been developed for stream data (Bao et al., 2018;Gupta & Grossman, 2004;Li et al., 2010;Lin et al., 2004). The most relevant ones are summarized here. One of the early tries of data stream clustering was CluStream (Aggarwal et al., 2003). This method is suited best to the clusters with the shape of spherical, and it has been enhanced to deal with uncertainty in the data stream. HPStream (Aggarwal et al., 2004) is a modification of CluStream, which can deal with high-dimensional data. This method reduces the dimension of the data by conducting a projection method that can minimize the radius of the clusters. Algorithms for stream data based on k-median and k-means have been developed by O'callaghan et al. (2002) and Beringer & Hüllermeier (2006). Although this kind of algorithms can reduce the consumption of the use of memory and has low computational complexity, users need to provide the number of clusters and the shape of the clusters are likely to be spherical. DenStream is another data stream algorithm based on DBSCAN (Cao et al., 2006). The algorithm uses microclusters to summarize the overall shape of clusters without the need to save all data in memory. Gao et al. (2005) introduced a grid-based method called DUCstream where they applied CLIQUE algorithm to find the dense regions. The algorithm disregards the regions whose density fades as new data come in to adapt to changes in the data stream. There is another incremental clustering method called IncDBSCAN proposed by Ester et al. (1998) which is a density-based algorithm. The method is based on DBSCAN. However, the algorithm does not consider the relationships among each update, so the efficiency of the algorithm is low. Al-SL (Patra et al., 2013) is a distance-based incremental clustering algorithm that can find clusters with any shape. The method calculates the distance between the new point and the closest leader point of a cluster to determine whether the new data point belongs to a cluster. The deficiencies of this method are that it is time-consuming to search the whole data space to find the surrounding leader points and the method is sensitive to noise. Another distance-based incremental clustering method is developed by Ibrahim et al. (2012) which can discover clusters of arbitrary shapes and densities in high dimensional data. Ning et al. (2010) proposed an incremental spectral clustering method by efficiently updating the Eigensystem. It can discover not only the stable clusters but also the evolution of the individual clusters, but it focuses only on the dynamic graphs. Bandyopadhyay & Murty (2016) proposed a Frequent Pattern Tree (FP tree) based incremental algorithm. Although this method considers the quality and the computational complexity, it can only deal with discrete data and is invalid for continuous data. Pensa et al. (2014) developed a hierarchical co-clustering method where a partition of objects and features were computed at the same time.
There are also several online methods that are based on GMM (Wu et al., 2005). The method developed by Hall et al. (2000) merges Gaussian components in a pair-wise manner by considering volumes of the corresponding hyper-ellipsoids. Song & Wang (2005) proposed a more principled method which uses two statistics to compare equivalence of the GMM parameters, the W statistic for covariance and the Hotelling's T 2 statistic for the mean. A common drawback of the previous two methods is that they fail to use the original model when they fit new data. A consequence is that the new model fitted by new data can only explain the new data which leads to the separation of the new model and the original one. Hicks et al. (2003) proposed a method to overcome this drawback. The method allows a change in the number of components, does not assume independence of the components to be added, and ignores the order that the training data arrives. Vasconcelos & Lippman (1998) also proposed a similar approach of combining Gaussian components. Although these methods can solve the problem of updating the models according to the newly arrived data, they fail to consider outliers while updating the models, resulting in two problems: 1) clustering results would be biased by outliers; and 2) outlier flights could not be identified for safety management.
In response, this study aims to develop an incremental clustering method to identify common patterns and detect outliers in flight operations from digital flight data recorder data as new data come in. The results can help airlines to identify safety risks, understand pilot behaviors, and track training effectiveness. Compared with existing methods, the advantages of the new method lie in that it can 1) detect outliers from flight data that accumulated periodically, 2) update the original model based on information from both new data and historical data, 3) identify new clusters if any, and 4) track changes in clusters over time.
The rest of this paper is organized as follows. Section 2 presents the proposed incremental clustering method for flight data analysis. In Section 3, three sets of simulation data are used to test the proposed method. In Section 4, the proposed incremental method is tested on two sets of flight data from real-world operations. In Section 5, we discuss the limitations of the proposed incremental clustering method. Finally, Section 6 summarizes our study and suggests future research directions.

Method
In order to achieve the aforementioned objectives, this paper presents the development of an incremental clustering method for anomaly detection with dynamically growing datasets. Under the assumption that most flights show common data patterns under routine operations, the proposed method detects these common patterns based on GMM and the incremental clustering method can update cluster parameters as new flight data come in. The statistical properties of each cluster, representing a common pattern in flight data, are described with Gaussian parameters and updated incrementally each time a new batch of data comes in.
Our proposed incremental clustering method contains two parts: offline and online. The offline part runs only once on a large set of historical flight data to get the initial parameters of a cluster model. Then, the online part runs whenever a new batch of flight data comes in and the cluster model is updated accordingly. Both clusters and outliers are then passed to airline safety experts or flight operations managers to review to identify safety risks, understand pilot behaviors, and track training effectiveness. The workflow of the proposed method is illustrated in Figure 3. The details are described in the following subsections. The pseudo codes of the algorithms are given in Algorithm 1 and Algorithm 2.

Pre-processing
A pre-processing step is needed to prepare the raw flight data for cluster analysis. After a certain part of a flight is selected for the analysis, flight data are mapped into comparable vectors in a high-dimensional space, anchored by a specific event in time. Because flight parameters have different ranges and units, the values of each flight parameter are normalized to have zero mean and unit variance for offline data. As for online data, we normalize them using the same standard as used for offline data. As a result, a certain part of a flight considering selected parameters is represented by a vector x shown in Eq.1. More details about this preprocessing step are introduced in (Li et al., 2016).
where x i j is the value of the ith flight parameter at time j; m is the number of flight parameters; n represents the total number of samples for every flight parameter.

Offline part: Initial Gaussian Mixture Model
In the offline part, an initial GMM is learned from a set of historical flight data by a robust GMM clustering method, and a set of outliers, O 0 , is detected based on the learned GMM parameters. Figure 4 shows the initial offline model that we get from the offline part of the method. The learned initial GMM is described by Eq. 2: where x is a random m-dimensional vector representing a random sample of flight data as described in Eq. 1; ..K 0 are GMM parameters estimated based on the historical flight data set D 0 ; K 0 is the number of components in the initial GMM; ω 0 i is the mixture weight of Gaussian component i, satisfying K 0 i=1 ω 0 i = 1; µ 0 i and Σ 0 i is the mean and the covariance matrix of Gaussian component i. O 0 represents a set of outliers, data points that do not belong to any clusters based on the learned GMM.
As many studies pointed out, the standard GMM clustering results are sensitive to the presence of outliers in the data. Cluster centers and model parameter estimates can be severely biased by a few outliers. Therefore, we adopted a robust GMM-based clustering method, introducing outlier-aware probability density functions and solving the associated maximum likelihood estimation problem via EM-like algorithms, as proposed in several robust GMM clustering methods (Forero et al., 2012;Gao et al., 2014;Chang et al., 2005;Hodge & Austin, 2004;Hautamäki et al., 2005). The basic idea is to look for a set of GMM parameters and a corresponding set of outliers that minimize the regularized negative log-likelihood as Eq. 3.
where the outlier vector o n is defined to be deterministically nonzero if x n corresponds to an outlier, and 0 otherwise. O := {o n : o n 0∀n = {1, 2, . . . , N}} indicates a set of outliers. π ≥ 0 is an outlier-controlling parameter that can be defined by the user. For π = 0, the cost in (3) becomes unbounded from below, and all x n are declared as outliers. For π → ∞, the optimum o n is zero, the dataset is deemed outlier-free, and (3) reduces to the conventional maximum likelihood estimation of a GMM.
To solve the minimization problem of Eq. 3, we adopted an expectation-maximization (EM) approach and block coordinate descent (BCD) iterations as proposed by Forero et al. (2012). Given the number of mixture components K 0 and the value of the outlier-controlling parameter π, the algorithm updates a set of GMM parameters and a set of outliers iteratively until convergence. In each iteration, the algorithm updates each set of parameters in one at a time with all other ones fixed. Specifically, the cost in Eq. 3 is minimized with respect to one of the parameters: ω, µ, Σ, O, while keeping the rest as fixed to their updated values in each iteration. The final result generated from the iterations are the learned GMM parameters λ 0 , which can also be described as (ω 0 , µ 0 , Σ 0 ), and a corresponding set of outliers O 0 . The robust clustering scheme for the offline part is presented in Algorithm 1.
In this robust clustering scheme, two parameters need to be specified as algorithm inputs: the number of mixture components K 0 and the outlier-controlling parameter π. K 0 is determined by sensitive analysis. A range of K 0 values are tested and the best K 0 is chosen with the lowest Bayesian Information Criterion (BIC) (Schwarz et al., 1978). The value of π is determined by α, the percentage of the data that we want to detect as outliers. In the offline stage, we first set the size of outliers to be detected via α based on user's preference. Then the robust clustering algorithm is run for a sequence of π with decreasing values, π g , until the expected number of outliers is identified (Forero et al., 2012). The expected number of outliers is calculated by α × N, where N is the total number of data points in the offline stage.
A by-product of Algorithm 1 is the outlier detection criterion which will be used in the online part to detect outliers when each batch of new data comes in. After the initial GMM is learned, the outlier detection criterion is defined below.
where s is a set of ordered x sorted by ln p 0 x | λ 0 , O 0 in ascending order. Thus, r is the log-likelihood of the αN th outlier belong to the GMM estimated in the offline part. This value will be used as a threshold to label out new outliers O new in the online part.
Therefore, α, the percentage of the data that we want to detect as outliers, is a user-specified parameter that affects the clustering results in multiple ways. It determines the value of the outlier-controlling parameter π, which is used in the offline part, as well as the value of outlier detection threshold r, which is used in the online part. In practice, the value of α can be set based on the user's preference, i.e. an airline can manually review at most a certain number of outlier flights per month due to the man-power constraint, while considering the distribution of the log-likelihood of all historical data on an initial GMM. For example, the distribution of the log-likelihood of the Digital Flight Data Recorder dataset in Section 4.2 has a long left tail (as shown in Figure 5). The value of α is set to 0.1% to best separate the outliers from the rest. Testing on both simulation data and real-world data, we found that any value between the 0.1th percentile and the 10th percentile can be chosen as the target outlier percentage α according to different dataset sizes.

Online Part: Incremental Gaussian Mixture Model
After the offline part is completed, the algorithm's online part runs whenever new flight data come in. It identifies emerging clusters and updates existing clusters using new data. It is performed in four steps: 1) classify new data and identify outliers based on previous GMM; 2) identify emerging clusters; 3) combine emerging clusters with existing ones to form extended GMM; and 4) update GMM using new data. The pseudo-code of the online part is presented in Algorithm 2.

Classify new data and identify outliers based on previous GMM
When new flight data are collected and fed into the model, the algorithm first classifies these new data based on existing clusters learned from the offline model or the previous update, which is denoted as Eq. 6: where T records the number of rounds that the online part has been performed. If it is the first time to run the online part, T = 1, the current mixture model is the initial model learned from offline data without the outliers p 0 x | λ 0 ; if not the first time, the current mixture model is the GMM updated from the last round.
Outliers that do not belong to any existing clusters are detected if the log-likelihood of a data point is smaller than the threshold r, whose value has been defined in the offline part. After this step, new outliers from the new data are identified, and they are combined with previous outliers O T −1 to form a set O new to be fed in the next step to identify any emerging clusters. Figure 6 illustrates this step. New data points are either classified into existing clusters or detected as new outliers.

Identify emerging clusters
The objective of this step is to find any emerging clusters from all outliers in new data and offline data. DBSCAN (Ester et al., 1996) is used to initialize clusters, if any. Then, GMM is used to parameterize identified emerging clusters. Figure 7 illustrates the step of identifying emerging clusters. DBSCAN is chosen to identify emerging clusters because it responds well to dense areas with sparse data points. We set the clustering criteria (MinPts and ε) to make the emerging clusters have a similar level of density compared to the existing clusters. The value of MinPts is set to 5 because the clustering result is insensitive to MinPts. The value of ε is set to the 90th percentile of the distance to the 5th neighbor of all data points in existing clusters. The emerging clusters identified by DBSCAN are then parameterized by GMM to be combined with existing clusters. To initialize the parameters of these emerging clusters, we use Eq. 7-9.
where x i represents data points belonging to emerging cluster i identified by DBSCAN, N emerging is the total number of data points in all emerging clusters, N emerging i is the number of data points belonging to emerging cluster i, Then all these parameters are further optimized using the standard EM algorithm based on data points in all emerging clusters identified by DBSCAN. Finally, to make the emerging components compatible with existing ones, we adjust the weights of emerging components using Eq. 10: a set of GMM parameters for the emerging clusters. K emerging is the number of emerging Gaussian components. N T −1 is the total number of data points in all existing clusters.

Combine emerging clusters with existing ones to form extended GMM
After the parameters of emerging clusters are estimated, these new clusters are added with the existing ones to form an extended GMM by row addition. The parameters of the extended GMM are represented by λ extended , as shown in Eq. 11.

Update and consolidate GMM
In this step, the structure and parameters of the extended GMM are updated and optimized to reach two objectives: 1) the centroid, shape, and weight of all components are adjusted considering new data; 2) any similar components are merged to maintain the compactness of a GMM and avoid overfitting.
First, all new data in this batch D T and outliers from the previous batch O T −1 are re-classified based on the extended GMM. Anomaly detection is first performed for each data point based on the log-likelihood criteria r, and the outlier dataset O T are updated accordingly. Then classification is conducted for each data point that passed the anomaly detection test according to the conditional probability. The number of data points belonging to component i, N newdata i , for i = 1 . . . K extended , are recorded. Second, the algorithm updates the parameters of the extended GMM considering the new information. Let λ updated i denote a set of updated GMM parameters for cluster i, λ . Eq. 12-14 describe how they are updated. where: w is a weighting parameter to balance the impact of new data versus historical data on GMM estimations, with a range of [0,1]. Here we set w as Eq. 18: Third, since the components may grow, move, or shrunk with dynamically growing data, the algorithm needs to check if any components become similar; if yes, they are merged to avoid overfitting with redundant components. Each pair of components is searched and tested for the equality of the covariance matrix and the means using the two statistics, W and Hotelling's T 2 , as proposed by Song & Wang (2005). If a pair of components λ following Eq. 19-23 as proposed in (Song & Wang, 2005).
For the unique components, they remain unchanged. λ unique describes the collection of parameters of unique components by Eq. 24 and 25.
Lastly, all merged and unique components are consolidated to a new GMM p T x | λ T , and the number of data points in each Gaussian component N T i , i = 1 . . . K T is updated accordingly, shown in Eq. 26-28: Figure 8 illustrates the step of updating and consolidating GMM. The pseudocode of the incremental clustering method is presented as below by Algorithm 1 for the offline part and Algorithm 2 for the online part.

Testing on simulation data
The performance of the proposed incremental clustering method was tested on three sets of simulation data: a low-dimensional set, a high-dimensional set, and a three-dimensional set without distinctive cluster boundaries. The true cluster membership labels of the simulation data were available to compare the performance of the proposed incremental clustering method and the traditional GMM method.
3.1. Simulation Data I 3.1.1. Data description The first set of simulation data is a low-dimensional unbalanced dataset from the School of Computing, University of Eastern Finland (Fränti & Sieranoja, 2018). As shown in Figure 9, each data point is described by (x, y) in a twodimensional space. There are eight clusters in two well-separated groups: three clusters on the left side with 2000 data points in each cluster, and five clusters on the right side with 100 points in each.

Testing procedure
To test whether the algorithm can detect new clusters and update existing clusters according to the new data, the dataset was divided into six sets: a set of offline data and five sets of online data. The offline set contained the majority of the dataset except for one dense cluster (Cluster 0). It included 85% of data randomly selected from Cluster 1 -7 and 1% of data randomly selected from Cluster 0. The rest of the data were equally assigned to one of the five online sets. The size of the offline set was 3845, and the size of each online set was 431.
The effectiveness of the proposed method was compared with the traditional GMM method. We applied the proposed method on the offline dataset and five online datasets sequentially. Meanwhile, we applied the traditional GMM method to the whole set of original data. The clustering results from these two methods were compared. We used the W statistics and Hotelling's T 2 statistics (Song & Wang, 2005) to check the similarity of the covariance matrixes and mean vectors in our proposed online method and traditional GMM method. Here we selected α=0.05 as the significance level.
The threshold of log-likelihood to detect outliers was set to -30 via testing, which was the 1st percentile of the log-likelihood of offline data based on the initial GMM model, as shown in Figure 10.

Results
Using the proposed incremental clustering method, the clusters can be updated with the growth of data, as shown in Figure 11. One can observe that when the first set of online data came in, a new cluster appeared (the blue cluster); as more online data came in, this new cluster became bigger and the parameters of other existing clusters were updated accordingly. The log-likelihood of an updated GMM after processing each batch of data is shown in Figure 12. The testing results show that the proposed incremental GMM method and the traditional GMM method generated equivalent clustering results. Figure 13 shows the number of data points in each Gaussian component detected by the proposed incremental method and the numbers by the traditional GMM method. In the incremental method, the number of data points in each cluster gradually increased and finally reached the same level as the numbers by the traditional GMM method. All Gaussian components identified by the proposed incremental method and the traditional GMM method passed the similarity test based on the W statistics and Hotelling's T 2 statistics, as shown in Table 1.  The computational cost of the proposed incremental method to analyze this dataset was smaller than that of the traditional GMM method, in terms of both running time and memory usage, as shown in Table 2. When processing each batch of an online dataset, the running time of the proposed method was 43% of the time required by the traditional GMM method, while the memory usage was as low as 9.1% of the memory usage in the traditional GMM method. So testing on the first set of simulation data shows that our algorithm is effective with low-dimensional data.

Simulation Data II
The second set of simulation data is the DIM-sets (high) data which is also from the School of Computing, University of Eastern Finland (Fränti & Sieranoja, 2018). This dataset contains 1024 data points distributed in 16 wellseparated clusters in a high dimensional space with 32 dimensions. Points within each cluster are randomly sampled from Gaussian distributions. We used this set of data to test the effectiveness of the model for high-dimensional data. The clusters are illustrated in Figure 14

Testing procedure
The dataset was divided into six sets: a set of offline data and five sets of online data. The offline set contained the majority of the dataset except for one cluster (Cluster 0). It included 10% of data randomly selected from this cluster and 85% of data from the other 15 clusters. The rest of the data were equally assigned to one of the five online sets. The size of the offline set was 824, and the size of each online set was 40. The same testing procedure as in the Simulation Data I test was used to evaluate the effectiveness of the proposed method by comparing it with the traditional GMM method. The threshold of log-likelihood to detect outliers was set as -130, which was the 1st percentile of the log-likelihood of offline data based on the initial GMM.

Results
Testing results showed that the proposed incremental clustering method updated existing clusters and detected the emerging cluster with the growth of data. Comparing the results of the incremental clustering method and the ones of the traditional GMM method, we found that most data were classified into the right clusters, but a few online data were misclassified. Figure 15 shows the number of data points in each cluster identified by the proposed method and the numbers by the traditional GMM method. All clusters identified by the two methods passed the equality test using  Table 3 shows the detailed statistical test results. In addition, Tabel 4 shows that the reduction in computational cost using the proposed incremental method. This computational cost reduction was more significant in this dataset than the cost reduction in Simulation Data I due to the increase of dimensionality. When processing each batch of an online dataset, the running time of the proposed method was 6.0% of the time required by the traditional GMM method, while the memory usage was as low as 9.2% of the usage by the traditional GMM method.

. Dataset
To further test the performance of our proposed method, we generated a new 3-dimensional dataset that contains 5 clusters with 600, 500, 400, 300, and 200 data points in each cluster. The clusters were not well-separated to test the performance of our algorithm on data without distinctive cluster boundaries. Figure 16 shows the five clusters of the datasets.

Testing procedure
This dataset was also divided into six sets: a set of offline data and five sets of online data. The offline set contained the majority of the dataset except for one cluster (Cluster 0). It included 1% of data randomly selected from this cluster and 85% of data from the other 4 clusters. The rest of the data were equally assigned to one of the five online sets. The size of the offline set was 1280, and the size of each online set was 144. We applied the proposed method on the offline dataset and five online datasets sequentially, and compared the results with the results we obtained from the traditional GMM method. Figure 17 shows the original clusters (Figure 17(a) -17(c)), the clusters detected by the traditional GMM method (Figure 17(d) -17(f)), and the ones by the proposed incremental method (Figure 17(g) -17(i)). In general, the clusters detected by the proposed method were similar to those detected by the traditional method and the original clusters. However, the clustering result of the proposed incremental method was not exactly the same as the original cluster labels, nor was the result of the traditional GMM, especially for the data points in the overlapped region of multiple clusters. This is caused by the nature of the GMM-based clustering methods. The data points in the overlapped region cannot be separated without dimension augmentation or additional information.

(a) 3-D visualization (b) 2-D visualization (X and Y) (c) 2-D visualization (X and Z)
Clusters identified by traditional GMM

(d) 3-D visualization (e) 2-D visualization (X and Y) (f) 2-D visualization (X and Z)
Clusters identified by the proposed incremental method To further compare the clusters identified by our proposed method and the clusters by the traditional GMM method, we applied the W statistic and Hotelling's T 2 statistic. Results of the two statistics are shown in Table 5. All clusters passed the equality test, which means that the proposed method can detect similar clusters as those in the traditional GMM method.

Testing on real-world flight data
This section presents the testing of the proposed method on flight data from real-world operations. Two sets of real-world data were used to test the proposed method. One was the aircraft trajectory data with two classification labels based on the Standard Terminal Arrival Route (STAR), and the other one was the digital flight data (QAR data) without any cluster label.

Flight trajectory data 4.1.1. Dataset
The aircraft trajectory dataset includes 2297 arrival flights to Hong Kong international airport in November 2014, and April to June 2015. The dataset is classified into two classes according to which STAR the flight belongs to. One class is ABBEY that contains 815 flight trajectories; the other class is SIERA that contains 1482 flight trajectories, as shown in Figure 18.

Testing procedure
The dataset was divided into 2 parts: one offline dataset which contained 80% of data that were randomly selected from the original dataset, and an online dataset which contained the other 20% data points. The online dataset was equally divided into 5 subsets. The number of data points in the offline dataset was 1186 and each online dataset was around 59.

Results
In the offline part of the algorithm, we found that when K was set to 7 the model resulted in the lowest BIC, as shown in Figure 19. Figure 19: Selection of K for offline part An initial GMM with seven components was learned based on the offline data. As the five sets of online data came in, the model updated itself accordingly. There was no new cluster detected in the online part. Seven final clusters were detected by the algorithm as shown in Figure 20. We compared the clusters detected from the proposed method with the true class membership (ABBEY and SIERA), and found that clusters 1, 4, and 7 were part of ABBEY arrivals and clusters 2, 3, 5, and 6 were part of SIERA arrivals. The proposed method detected more clusters than true labels. This is because these flight trajectories have another level of patterns within each STAR depending on which runway the aircraft lands on.

Digital flight data recorded by QAR
Finally, we tested the proposed method on a set of digital flight data recorded by QAR for flight operations and safety analysis. Since no ground truth (i.e. classification labels) was available -all flights were safe with no incident or accident, we evaluated the proposed method by comparing it with the traditional GMM method and discussed the identified common patterns and outliers from the perspective of aircraft performance and pilot operations based on the input of domain experts.

Dataset
The set of QAR data for testing records operations of an international airline's B777 fleet in 11 months (December 1, 2016to October 30, 2017. The original data contains 10674 flights and 104 flight parameters. In this paper, the testing was only performed on the take-off phase for demonstration. Nine flight parameters were selected with the help of domain experts for analysis of aircraft performance and pilot operations during the take-off phase. The selected key flight parameters are summarized in Table 6. Similar to previous testing using simulation datasets, the original dataset was divided into one offline set and five online sets. To demonstrate the capability of the proposed method in capturing emerging clusters, we selected 99% of the flights of one cluster (denoted as Cluster 0) identified by the traditional offline GMM method and 10% of the remaining data randomly as the online sets. The rest of the data were used as the offline set. Therefore, the online sets included 2918 flights in total, 2068 flights from Cluster 3, and 850 flights from other clusters or outliers, which were evenly distributed into five sets. The offline set included 7748 flights in total.

Testing procedure
Data preprocessing was first performed to re-sample and normalize the values of different flight parameters. Position-related parameters were converted into values relative to the takeoff runway coordination system. After the data transformation, each flight's takeoff phase was represented by a vector with 810 dimensions (9 parameters 90 seconds). To reduce the dimensions, we performed the principal component analysis (PCA) to keep the first few components that contain 99% information of the original data. After PCA, the number of dimensions was reduced to 98.
Since there is no ground truth regarding cluster membership of each flight in the real-world data, the clustering result from a traditional GMM method was used as the benchmark. Regarding the selection of K (number of mixture components), we found 3 to be the optimal value for this dataset as it gave the lowest BIC value, as shown in 21.

Offline clustering
Then the offline part of the proposed method was performed on the offline dataset to establish an initial GMM. We tested different K for our offline part and found 2 to be the best K for the offline data, shown in Figure 22, which makes sense because we artificially extracted one cluster out as our online dataset. After the initial GMM model is established, we calculated the log-likelihood of each offline data and set a threshold of -500 which could detect 0.1% of data points in the offline part of the algorithm as outliers. The threshold was used in the online part to detect outliers. If the log-likelihood of a data point was smaller than the threshold this data point was regarded as an outlier, and if the log-likelihood of a data point was bigger than the threshold this data point was tagged as a normal data point.

Online clustering
The online part of the proposed method was run each time a set of online data were fed into the algorithm. The clusters were updated dynamically with each batch of online data. The number of points in each cluster identified in each round of incremental clustering is summarized in Figure 23. Compared with the traditional GMM method, the number of points in each cluster was similar. The similarity of the clusters identified by the proposed incremental method and the ones identified by the traditional method was checked using the W statistic and the Hotelling's T 2 statistic. We found that all three clusters identified by the two methods passed the equality tests regarding the cluster centroid and the covariance matrix. The results are summarized in Table 7.
The computational cost of using the proposed incremental method was significantly smaller than the cost of using the traditional GMM method on this set of flight data, as shown in Table 8. The running time of the proposed method was only 1.2% of the running time of the traditional GMM method when dealing with each batch of online data, while the memory usage was as low as 4.7% of the traditional GMM method. With the increase of data size and dimensionality, the benefit of reducing computational cost would be more significant.

Common patterns of flight data
The common patterns of flight data identified by the proposed method are presented and discussed in this section. Using the proposed incremental method, we expect to observe changes in the common patterns of flight data as new data come in, e.g. clusters drift, emerge, or disappear, when any major changes are introduced in the flight procedures or pilot training methods. However, there were three clusters in this set of data, and no major changes happened during the time period of collecting this dataset. So we artificially excluded data points of Cluster 3 from the offline data, and gradually inserted those data points back into each online set, to test if the proposed incremental method is able to identify this emerging cluster.   Figure 24 shows the two clusters identified from the offline dataset by the proposed method. The colored bands depict the value range of a flight parameter in a cluster. The blue bands represent Cluster 1, while the red ones represent Cluster 2. We can observe that flights in Cluster 1 are the ones with less load, used less power for takeoff, climbed slower, and departed straight-out, while flights in Cluster 2 were heavier, used higher power settings, accelerated faster, and made a turn after the initial climb. Figure 25 shows the three clusters identified after processing all five sets of online data. As the new data come in via each online set, the proposed method was able to identify the emerging cluster. This emerging cluster, Cluster 0, is depicted in green in Figure 24. We observe that flights in Cluster 0 share similar patterns in Roll attitude as flights in Cluster 1. Flights in both Cluster 1 and Cluster 0 were straight-out departures. The difference between Cluster 1 and Cluster 0 lies in Gross Weight and N1 (an engine power indicator). Flights in Cluster 0 had larger gross weight values and used higher take-off power settings than flights in Cluster 1. These clustering results can be used for safety management and efficiency improvement. The three clusters summarized the common patterns of pilot operations during takeoff for this aircraft type at this airline. Airline safety experts and pilot training managers can check if these patterns meet operational standards. In this example, some potential issues were identified by the airline flight operations expert in Cluster 1 and Cluster 0. Both clusters showed a tendency of double-rotation as exhibited in the two peaks in Pitch Attitude and Vertical Speed, while Cluster 2 had no such pattern, as shown in Figure 24 and Figure 25. An in-depth analysis needs to be carried out to understand why and if any flight operation or training procedure needs to be modified. As demonstrated in this case study, the proposed incremental method was able to capture cluster changes over time, e.g. drift, emerge, or disappear. Further analysis based on the clustering results can be carried out to identify the root causes for such changes, e.g. pilot training, aircraft performance, airport conditions, and arrival/departure procedures. If the airlines would like to measure the effectiveness of particular training, using the proposed method to examine the cluster changes of flight data before and after the training can give a quantitative assessment.

Outlier flights
Setting the outlier detection rate as 0.1%, we detected 14 outliers using our proposed incremental method and 12 outliers using the traditional GMM method, among which 12 outliers were commonly detected by both methods. Table 9 summarizes the abnormal behaviors of these outliers observed from the flight parameters. The detailed flight parameter profiles of example outliers (highlighted in red in Table 9) are shown in Figure 26 and Figure 27. Table 9: Summary of outliers detected by the proposed incremental method and the traditional GMM method (* example flights shown in Figure  26 and Figure 27  Five outlier flights shared similar features of high-energy takeoffs. Figure 29 shows one of the high-energy takeoffs, Flight 4618. The gross weight of this aircraft was relatively light, but the take-off power was set close to the upper bound of any cluster, resulting in fast acceleration and climb, and a significant reduction in power, flap setting, pitch attitude, and vertical speed around 75 seconds after take-off. This type of outliers shows that the proposed method is able to detect atypical flight profiles, which need to be reviewed by safety experts to evaluate potential risks, if any.
Another type of detected outliers may be caused by sensor or data recording issues. For example, abnormal values in the Angle of Attack were observed in Flight 8298, 9084, and 10045. As shown in Figure 27, the Angle of Attack had large negative values at the beginning of the takeoff phase. This may be caused by sensor malfunctions, data recording issues, or other hardware/software issues not related to pilot operations. If airlines would like to focus on monitoring pilot operations, a pre-processing step can be developed to filter this kind of anomalies out. However, if airlines would also like to check if any hardware or software issues related to flight data collection and recording, the proposed method can also be used to detect this type of anomalies.

Discussion on limitations
The testing results show that the proposed incremental method can generate statistically equivalent clustering results as the traditional GMM method on simulated data as well as real-world digital flight data, with significantly reduced memory requirement and processing time. However, there are limitations to the proposed method.
First, our proposed method cannot identify the nonlinearly separable clusters because Euclidean distance is used as a similarity measure. To solve this problem, kernel methods can be used to solve this problem to extend the capability of the proposed method.
Second, no theoretical convergence analysis and optimality quantification is provided in this paper. On convergence, each step of the proposed method, i.e. offline robust GMM, DBSCAN, EM algorithm for online GMM update, is guaranteed to converge under certain conditions as they are standard methods and have been proved in past literature. Yet, the convergence of the overall incremental method is challenging to prove. There are several papers on the convergence analysis on online EM algorithms. In these papers, the number of clusters is fixed, and the clustering parameter updating is in the framework of EM. Unlike those papers, in our proposed method, the number of clusters is not fixed, and both EM and non-parametric algorithms (i.e., DBSCAN) are applied. So, the statistical convergence relies on model specifications. On optimality, the proposed method may generate a local minimum result on two levels: 1) the modified EM iterations in the offline robust GMM and the online GMM update may be trapped in local minimums; 2) existing clusters cannot be split into several clusters in the following online update part. In addition, clusters with strange shapes may be detected by DBSCAN, such as long chains. Problems will arise when GMM is applied to describe these clusters. The density, location, and spread of each recognized cluster can provide some information on whether the clusters are well separated and dense or not, yet further work is needed to use these measures to calculate cluster validity indexes without performing calculations on the entire data (including offline and online data batches).
In this section, we provide a set of sensitivity analyses on the internal validity of clustering results under different online-to-offline data ratios to inform future studies. Intuitively, when the accumulated size of online datasets outweighs the size of the offline dataset, it is more likely to result in local optimal using the incremental method. Therefore, the robustness of the proposed method was tested by changing the relative size of the offline dataset and online datasets. The three simulation datasets and the two sets of real-world flight data used in Sections 3 and 4 were randomly segmented into one offline set and 15 online sets. Each online dataset contained data of 10% size of offline data. As online data came in, the total size of online data went up to 1.5 times of offline data size. We compared the clustering results of the proposed incremental method after processing each set of online data with the clustering results of re-training using a traditional GMM method based on the W statistics and Hotelling's T 2 statistics.
Figure 28 - Figure 32 shows the results of this set of sensitivity analyses. The difference between the clustering results of the two methods becomes larger as the ratio of online data to offline data increases. The divergence increases significantly when the total size of online data exceeds the size of offline data as indicated by the slope changes of the curves in Figure 28 - Figure 32. Therefore, we conclude that full re-training, running the offline part of the algorithm on all available data, becomes necessary when the ratio between the total size of online data and the size of offline data is close to 1.

Conclusion
A new incremental clustering method for anomaly detection in flight data is presented in this paper. The method can identify emerging clusters, update existing clusters, and consolidate any redundant ones with dynamically growing data processed in batches. The proposed method was tested on both labeled simulation data and unlabeled real-world data. The results show that the proposed method can generate similar clustering results as the traditional one-off GMM clustering method.
Further work will also be carried out on the testing and implementation of the proposed method. The proposed method needs to be further validated by expert reviews, case studies, and cross-checking with existing tools. A set of tools need to be developed for data flow management, feature engineering, parameter settings, and results interpretation to implement the proposed method at airlines for safety management and pilot training.
Another direction of further work is to modify the current method to have theoretical convergence and provide theorems on the incremental estimation of the mixture models, such as proving that an offline model for entire data at any time can be obtained by incrementally updating an online model based on newly arrived data.