Anomaly detection on flight route using similarity and grouping approach based-on automatic dependent surveillance-broadcast

the


Introduction
Li and Hansman [1] investigated the anomaly detection with the cluster method to classify flights under a general pattern, found that flights outside the standard pattern were anomalous.Another advance development is an abnormal flight detection based on cluster analysis without knowing the norm standard [2].The difference from the previous study lies in the type and airline that are not specified, and flight problems should not be known.These detection processes need experts to validate the results of anomaly detection analysis.
The anomaly detection required computational analysis.There are some data training in a specific month period as a modeling reference.Models formed based on custom analysis of flight patterns on an airline route.The clustering method K-Means cluster analysis in earthquake epicenter clustering by categorizing data based on properties is exact because the resulting model does not require ground truth as a referral or a justification from an expert.In this study, the number of k specified is two: a big group of standardized data and the small one as anomalous data [3].They perform data analysis that has the same call sign of routes and airlines.The detector recognizes specific segmented locations as anomalies.aircraft (latitude, longitude), height (altitude), as well as related information about the aviation guidance recommended by the International Civil Aviation Organization (ICAO) [4].In 2017, ADS-B studied aircraft monitoring from outer space.The result is a worldwide air traffic control and compares the accuracy of ADS-B information from the ground with those in space [5].Furthermore, ADS-B utilization for an aviation anomaly detection study generated an algorithm on traffic and warning on aviation traffic [6] and the determination of areas for conflict detection on aviation traffic routes [7].
Several recent studies continue to be developed on classification methods using support vector machines (SVM).This method is more appropriate to solve binary classification problems with two classes in features that are linear at an interval [8].Here, forming binary classes is a necessity for multiclass classification.The binary grouping techniques in SVM begin by looking for an optimized hyper hypothesis that distinguishes positive and negative samples [8] [9].Also, the SVM classification method has a robust training process.An implementation of a regression method is essential for better SVM performance [10].The other classification method is k-nearest neighbors (k-NN).This method depends on a set of previously determined labels and classifiers.In the training data that is the closest distance to the classifier, several different groups will be formed from one another [11].In the latest study, the k-NN classification can be applied to discrete and real data and produces better accuracy values.However, this lazy algorithm requires precise accuracy, one of which is in determining the closest (regional) distance [12] using the Euclidean distance [13].
Another grouping method is carried out as a comparison (i.e., K-means clustering).The developed model is better than the baseline if the results of the comparison between the classification model (supervised) and the clustering model (unsupervised) have minimal precision.For optimal results, the cluster method used is different from the classic cluster algorithm, which is without random centroid initialization [14], is done to reduce the iteration process.The k-means cluster initialization uses maximum and minimum values to obtain the optimal cluster results [15].Another model used is the similarity model, used as a determinant of data with similar characteristics.The latest research similarity method occurred internally and externally [16].The internal formed from pre-processed training data, while the external is a reference data set.The similarity method is log-likelihood, compare the loglikelihood value to obtain a measure of similarity (coefficient) [17].The grouping method and classification can be done based on this coefficient.The next step is the similarity model testing, which divides the log-likelihood value of the test results with the log-likelihood value of the model.This method is called the log-likelihood ratio or exact log-likelihood ratio method [18].
This study determines the anomaly detection area on a flight route without having to know flight anomaly criteria based on the formed segment.This research also proposes a new testing method based on maximum and minimum similarity models of 30 days of trained data reference.

Method
Fig. 1 shows the framework in this study.This research consists of several stages, such as data collection, preprocessing, segment, training, and testing.The following sections describe these stages in more detail.

Dataset
The data source used in this study is ADS-B, collected from the RTL 1090 radio device [19].The data period is one month with 100 records in MySQL DBMS, used in the computational process.The parameters are date, latitude, longitude, and altitude.Table 1 shows a sample of ADS-B data for callsign LNI860 / SUB-PLW.The data with the DATE attribute contains a date, month, and year information -for example, the attribute DATE: 180414, the flight data on April 18, 2014.Furthermore, the attribute indicates the position of LATITUDE and LONGITUDE is a Cartesian coordinate (latitude, longitude).In general, the latitude value is negative (-), which denotes the longitude coordinates.Meanwhile, the value of longitude is positive (+) which denotes the latitude coordinates.The ALTITUDE attribute is data about the height of the plane as measured from the ground.For altitude units are feet or nautical miles, then on ALTITUDE = 35000 means 35000 feet = 5.76 nautical miles.
To get the data set in tabular format, first, the ADS-B data obtained is done by the feature extraction process.Data obtained from the process of broadcasting information from Register Transfer Level Software Defined Radio ADSB-B (RTL-SDR ADS-B) system.RTL-SDR ADS-B receives information from the ADS-B signal of the aircraft in the data format, as in Fig. 2.

Preprocessing
Pre-processing generated training data from in the form of training data derived from the mean of each record (rec-1 until rec-n, n = 100) ADS-B data (for 30 days).Fig. 3, shows the stages of preprocessing to form training data.

Segment
The segment forms data into groups based on the nature or habit of data records.Based on the experiment, the number of segments is 5.Each segment contains parameters of latitude and longitude with 20 records.Fig. 4 shows the process of training data segmentation.Table 2 illustrates five segments of the dataset with records in each segment.

Training and Models
Training determines a model.A similarity test (log-likelihood) for each segment determine the max and min values.These values are used as a classifier and early centroid for the grouping process.Fig. 5 shows the stages of modeling based on similarity test.

Testing
This stage tested the generated model.The similarity test compares the obtained log-likelihood value with the model log-likelihood value: C1 and C2 respectively.There are two log-likelihood ratios (LLR) values: LLR_C1 and LLR_C2.If LLR_C1 is less than LLR_C2 (LLR_C1 <LLR_C2), then the data test data is C1.Otherwise, it should be C2 Fig. 6 shows the testing stages between models.

Performance Testing
The next stage uses a confusion matrix for computational accuracy measurements.The resulting indicator is the value of accuracy and precision.This measure is generated from several parameters, namely True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN).There are formulas (1) and (2) used to measure system performance based on precision and precision values.
The purpose of data in normalization is to obtain reference data for similarity modeling.The reference data is the average of some obtained training data in each record.Here, MSE and SSE are determined based on latitude and longitude.On the other hand, grouping divided data into two classifications and clustering.The classification used two methods: Support Vector Machine (SVM) and K-Nearest Neighborhood (K-NN).This research used the K-Means clustering method.

Support Vector Machine (SVM)
In some cases, the Support Vector Machine (SVM) [21] is suitable to handle the classification with an imbalanced dataset.Although the resulting sensitivity is less useful, the resulting accuracy is better.For example, Prahara et al. [22] proposed the use of SVM for the introduction of motor vehicle license plates based on edge detection.The process takes place to form an area that signifies the shape of the vehicle plate.A combination of Histogram of Oriented Gradients (HOG) + SVM is used to localize the number plate from a specific region with the candidate number plate extracted using the vertical edge density method to achieve reasonable accuracy.Latah and Toker [23] use of SVM for classifying the type Daniel of Service (DoS) attack, so that the user cannot access the computer network system.The resulting accuracy is 96.25% detected DoS attacks of type flooding.SVM application of a data set is Z{(x1,y1),(x2,y2),…,(Xn,yn)}.For , which is the label of the group and i=1, 2,...,n.The following equations formulate a set of SVM hyperplanes [23].

K-Nearest Neighborhood (K-NN)
K-Nearest Neighborhood is a supervised classification method characterized by a labeled class [24].The application of K-NN is to classify signal speech in a noisy environment.The result is that there are two signal classifications in the speech feature.There are two groups, namely, the speech signal and the non-speech signal.Then, testing is done to determine which test data are in which group.
In this study, the accuracy was 80%.Bhattacharya et al. [25] proposed modifications of k-NN algorithm applied to fifteen numerical data sets from UCI data repository machine learning.Based on the 5-fold and 10-fold cross-validation, the average accuracy is better than using a typical cluster algorithm.

K-Means
K-Means clustering is used to define a set of data that resides in one group, where the distance between group members and the centroid is minimal [26]- [28].The following formula describes the calculation centroid in a cluster.
where Ck,j is the centroid of group-k, variable-j, and a is the number of members in group k.
The general method is described in the following steps [3]: 1) determine the data be clustered, 2) apply K-means cluster analysis to the earthquake data, 3) compute the Krzanowski and Lai criteria for optimum K number of clusters, and 4) determine the optimum K number of cluster.In this study, we use the KL index to determine the optimum K. KL index is used to determine the optimum K number of clusters.Based on the KL index of the formed cluster, the most extensive KL index indicated that the amount of the cluster is the optimal number of clusters.There are 7 clusters with attributes latitude mean, longitude means, magnitude means, frequency means, and SSE.

Log-likelihood and Log-likelihood Ratio
The similarity approach is determined by the log-likelihood method for model similarity and loglikelihood ratio for testing.The likelihood principle is that if two experiments involve a model with  parameter, give the same likelihood value, then the inference to  must be the same, as explained in the following equations [29].


X is an observation matrix , negative log-likelihood (LL) [30] is For testing, the log-likelihood ratio test [31] is where a likelihood ratio is and Log-likelihood ratio is

Results and Discussion
The achieved result of the pre-processing stage is forming training data.Based on reference data from ADS-B for 30 days accompanied by the process of data normalization, it generated training data.Table 3 shows the reference data ADS-B for 30 days.In the ADS-B data for 30 days, the number of records is 100 per day.Table 4 showed training data along with normalization values, namely MSE (Mean Square Error) and SSE (Sum Square Error).The normalization process eliminates duplicate data from existing ADS-B reference data.In this study, the parameters used in training data were latitude and longitude.
The segment aims to determine an area for the anomaly detection process in a real-time period.This study has five determined segments.The data per segment based on training data were 100 records.Then, the similarity of each segment data was tested by reference data ADS-B per day for 30 days of data.
Model is something built on similarity test results (log-likelihood).Several performed steps are segment, calculation of log-likelihood value, and grouping.The classification used two methods (SVM and KNN), while clustering implemented the K-Means method.
Table 5 shows an anomaly detection model based on a similarity test (log-likelihood).The test similarity with training data obtained data, containing the log-likelihood value of 30 records.The loglikelihood value grouping based on max log-likelihood (C1) and min log-likelihood (C2) values.Loglikelihood latitude and log-likelihood longitude values determine C1 and C2.
The distribution of data in each class C1 and C2 is showed in the graphical representation.For grouping with KNN and K-Means, it is always pair in each segment.As a result, the percentage and the distribution of data in C1 and C2 were equal.Fig. 7 and Fig. 8 show the data distribution with SVM grouping.Fig. 9, Fig. 10, and Fig. 11 show the data distribution using KNN and K-Means The testing process is performed by using ADS-B test data that is not ADS-B (30 days) reference data for training data.The stages are similar to the modeling process.Test data is determined in several segments (segment 1 to 5).Furthermore, a similarity test (log-likelihood) of the test data was conducted with training data.Based on the log-likelihood value, we compared the log-likelihood value in the training data (C1 and C2).The result is a log-likelihood ratio (LLR) calculation.Small LLR values, then the test data is a particular class (C1 or C2).
Test results determined based on LLR calculations along with the determination of the following sequential classes per segment.The result is detailed as follows: Segment 1, obtained LLR_C1 latitude, LLR_C1 longitude is (79.23 For accuracy and precision measurement is shown in Table 6 and Table 7. Table 6 shows the accuracy (%) of each grouping that occurs in the segment, while in Table 7 shows the precision (%) on each group on the segment.Based on F-measure, Table 8 shows FPR and TPR per segment based on SVM, and K-NN & K-Means.Percentage of 100% in ROC is due to the rate of TPR in each grouping (SVM, K-NN, and K-Means) is 100%.There is only one percentage worth 96% (i.e., in segment 1 in the K-NN and K-Means groupings).FPR percentage has a different rate.The maximum is 50%, while the minimum is 0%.

Conclusion
The result is a log-likelihood (LL) model in aviation anomaly detection based on ADS-B.The subsequent studies are determined based on the formation of segments and testing processes, and the segment is formed by distance-based clustering.In addition, the testing process is done by forming a new model again in testing data (re-modeling) without calculating the value of the Log-likelihood Ratio.Based on the clustering approach, the highest percentage of the most data on C1 is in the fourth segment and fifth segment.The percentage of K-NN and K-Means is 96%, and SVM is 93%.While the highest percentage of C2 is in the first and second segments with a 10% percentage of SVM, K-NN, and K-Means.

Fig. 2 .
Fig. 2. information from the ADS-B signal of the aircraft Some bold numbers indicate the main attributes extracted to the DBMS (MySQL) for computational analysis.

Fig. 5 .
Fig. 5. Model creation through the similarity test

Table 1 .
ADS-B data

Table 2 .
Five segments of the dataset

Table 3 .
Data Source ADS-B for 30 days

Table 4 .
Data training for normalization results

Table 6 .
Accuracy (%) in each group that occurred in the segment

Table 7 .
Precision (%) in each group that occurred in the segment

Table 8 .
FPR and TPR