Machine learning-based turbulence-risk prediction method for the safe operation of aircrafts

This study has proposed a method for detecting turbulence, a primary factor that influences safe aircraft operation. The number of observed turbulence events is limited, thereby indicating the requirement of an appropriate flow for detecting turbulence events from a small number of samples. In addition, the opinions and experiences of pilots must be reflected at the initial stage to address the high risk of turbulence occurrence, which can result in airline operations being cancelled. Thus, this study proposed a method for predicting turbulence occurrence based on the turbulence occurrence date information provided by airlines as well as meteorological data sets obtained from open data available in Japan as teacher data. However, because commonly used machine learning methods are unable to detect the turbulence occurrence date, the proposed method employed principal component analysis coupled with the K-Means method to generate risk clusters with a high likelihood of turbulence occurrence and consequently perform statistical checks. Subsequently, the risk clusters were utilized as supervisory data for turbulence occurrence, while the support vector machine was used for predicting turbulence occurrence. Furthermore, the results obtained with the proposed method were statistically checked as well as practically verified by a pilot to confirm the appropriateness of the turbulence occurrence date predicted.

operational acceleration limit of the aircraft, the scope of maintenance work increases considerably, thereby significantly impacting aircraft operation schedules. Therefore, airlines must strive to avoid severe turbulence to the best extent possible. However, if reports regarding turbulence rely primarily on the opinions of pilots, which tend to vary, variations in reports provided by them are inevitable.
Consequently, this study proposed a method for predicting turbulence occurrence, with an aim to contribute to the safe and comfortable operation of aircrafts. Figure 1 outlines this method, which involves the accumulation and aggregation of open data and quick access recorder (QAR) data [4,5]. In addition, the prediction of turbulence using machine learning methods is outlined as well, the results of which are fed back to airlines and pilots. Flights to and from Matsumoto Airport in Japan, on E-170 aircrafts operated by Fuji Dream Airlines (FDA), have been observed to frequently experience turbulence during the winter season. In this study, the Matsumoto Airport was considered as the model airport representing mountainous areas subject to turbulence. The proposed technique can also be adapted to other airports.
For conducting the study, meteorological data from Japan and turbulence information provided by FDA were used. Because turbulence is a relatively rare event, first, the risk cluster was estimated. To this end, a principal component analysis (PCA) of the meteorological data was conducted to obtain a projection matrix W such that the number of dimensions of the data to be analyzed was reduced. Subsequently, using the turbulence-occurrence indicator and meteorological data transformed by W , the k-means method was employed to calculate the risk cluster, which is required for predicting the days with turbulence risk for meteorological data from the year 2019 through support vector classification (SVC). The results based on this meteorological data revealed that the prediction method accurately identified the days with a risk of turbulence.

Related work
Most existing research concerning turbulence prediction has been performed from a meteorological perspective [6,7], such as studies conducted to examine past turbulence incidents [8]. In an event that occurred in central Colorado on January 11, 1972, optimal conditions for strong mountain wave generation were detectable from sounding data 12-24 h in advance and approximately 1000 km upstream [9]. Further, in the case of a fatal accident involving a light aircraft near Clonvina Inn, Victoria, Australia, on July 31, 2007, the observed environment was analyzed and consequently through a threedimensional simulation the region where turbulence intensified was identified [10]. When a Boeing 777 encountered severe clear-air turbulence (CAT) over western Greenland at an altitude of 10 km on May 25, 2010, through digital flight data recorder (DFDR) analysis and high-resolution numerical simulations the operation of a high-resolution non-hydrostatic simulation model was confirmed to predict mountain-wave turbulence (MWT) [11]. Thus, understanding past examples are crucial to identifying and predicting the conditions for turbulence.
Currently, analysis using the QAR (Quick Access Recorder) data on board the aircraft is also under consideration for predicting turbulence. Further, new methods for estimating eddy dissipation rate (EDR), considered as a measurement of turbulence, through QAR [12], comparison of calculation algorithms [13], and development of QAR data analysis software to calculate meteorological quantities such as three wind components, wind shear risk coefficient, and turbulence intensity parameters [14] have been proposed as well.
In the current aviation industry, a method for turbulence detection involves the use of Doppler lidar [15][16][17]. A laser beam (using a wavelength band that is safe for the pilot's eyes) can be fired into the atmosphere to observe winds in the sky. Although CAT cannot be detected by conventional aviation weather radars, airborne predictive windshear (PWS) radars enhanced with algorithms designed for turbulence detection and longrange airborne Doppler lidars have been developed and operated [18][19][20]. Consequently, turbulence detection using these systems has resulted in a reduction in the number of turbulence encounters by alerting pilots to the possibility of encounters.
However, in these studies, data were acquired in real time from many sensors and analyzed using a time-series approach [21]. Although turbulence forecasting with pinpoint accuracy is desirable, preparing a suitable environment for the sensors results in significant cost, and thus, it is infeasible for airlines.
In recent years, owing to the accumulation of aviation data and improvements in computation rapidity, the concept of turbulence prediction via machine learning has been introduced [22,23]. However, studies concerning this subject are limited. Furthermore, determining an optimal machine learning approach for turbulence prediction is challenging. Moreover, there exists a need to utilize open data (such as meteorological data) to improve analysis accuracy as it can aid in the development of turbulence predictions that can be logically deduced from the data provided by the airlines. For example, in a detailed study of the causes of 700 fatal aviation accidents involving commercial airliners that occurred worldwide between 1990 and 2006, it was found that the composition of accident causes varied greatly depending region of the world, type of operation, and category of aircraft [24]. Further, a study proposed a turbulence prediction algorithm that was based on the examination of turbulent weather phenomena and aircraft operations using a stepwise multiple regression analysis model [25].
Thus, the above discussion reiterates the importance of developing of a system that can predict turbulence, among the most common aviation accidents, independent of the equipment and environment used to acquire the data.
This study attempted to approach machine learning from a non-meteorological perspective. PCA was employed to generate risk clusters for the data and determine the prediction accuracy. Several studies have followed a procedure similar to that of this study [26]. For instance, when attempting to identify the relevant genes for gene expression classification, the data was passed through PCA and independent component analysis methods, and based on the variants of the class obtained, the selected elements were individually transformed to lower dimensions. Consequently, the classification performance of the experiment was evaluated using a support vector machine kernel classifier [27]. Further, in Classifying Colon Cancer Microarray Data, PCA and Partial Least Square (PLS) have also been used to extract more features [13].
However, this has never been done in case of turbulence analysis. Therefore, this study examined the possibility of it being applied as a new method in the field for turbulence prediction.

Basic analysis of turbulence at matsumoto airport
In this section a basic analysis of the data collected at Matsumoto Airport is described.

Examples of the effects of turbulence on flights from Matsumoto Airport
Considering the topographical characteristics shown in Fig. 2, it can be inferred that flights operating from Matsumoto Airport are susceptible to mountain waves [28] from the Northern Alps, particularly on the route toward New Chitose Airport. Table 1 summarizes the turbulence, presumably caused by mountain waves, reported by flights departing from the Matsumoto Airport. The authors were present on the flight that departed on December 27th, 2017, to gain a real-world understanding of the level of turbulence faced during a flight. Table 2 shows the meteorological conditions during

Visualization of the wind direction, speed of mountain waves, and sway of aircrafts
The first step toward solving the problem involves visualizing the turbulence and its resulting impact on operations. Thus, a visualization depicting a severe turbulence scenario was created, wherein the altitude changes during turbulence were modeled as per the flight of an E170 aircraft. Further, the aircraft altitude at every second was depicted using Google Earth Pro 7.3.4. Figure 3 shows the visual representation of a journey via FDA Flight 211 in January 2018, wherein the pilot encountered severe turbulence during   figure). Consequently, significant altitude changes were observed during this period.

Elementary analysis of turbulence occurrences using open data
To create the dataset, weather information from October 1, 2017, through March 31, 2018, were obtained from Sunny Spot [29], which is the website homepage of the Japan Meteorological Agency [30]. In addition, an environmental database provided by Iowa State University [31] was used as well. Subsequently, a dataset with 165 rows and 45 columns was created as an explanatory variable. Table 3 summarizes the items in this dataset. Using real-world QAR data from a pilot report provided by FDA, Yes/No values were obtained for indicating whether any FDA flights that either departed from or landed at Matsumoto airport encountered a greater than moderate ("moderate-plus") or higher level of daily turbulence during the observation period. Three instances of moderate-plus turbulence exist in the data used in this study. These data were described based on "location-time-altitude-type. " Figure 4 illustrates the boxplots of fx106-03-500-spd, Wajima-12-700-temp, Matsumoto-12-500-hum, and fx106-03-500-shear; here, all data are normalized. The circle, triangle, and square in each boxplot represent the instances of turbulence in the data. In addition, on the days when turbulence was observed, the wind speed and shear were high, while the temperature was low [33].

Turbulence-occurrence analysis using PCA
Owing to a lack of sufficient data for observing patterns in annual turbulence, predicting its occurrence through supervised learning is challenging [34]. In addition, there exists a possibility of weather conditions affecting operations on days other than those on which turbulence was reported. Further, meteorological data comprises many explanatory variables, and determining the variables that contribute to turbulence is complex. Thus, in this study, to supplement the scarce information on the day of turbulence occurrence, represent the weather conditions affecting flight operations as well as contributing to the decision-making process of pilots and airlines in implementing flight operations in high-risk environments, the formation of risk clusters was determined using PCA and statistical information on their weather conditions. In addition, a method for determining forecast accuracy was applied as well. First, the limits of the explanatory variables in the PCA were reduced and the weights were used to calculate the risk clusters employing the k-means method. Consequently, the risk cluster obtained was used to predict the occurrence of turbulence through SVC. The program was executed in the Python 3.7.0 environment and scikit-learn version 0.23.2 was used as well. The algorithm is described as follows.
(1) Creation of a dataset for turbulence predictions, using open data (2) Calculation of turbulence risk cluster (a) A projection matrix W is created via PCA [35] (i) Let the i-th data be x i , and let Y be all the rows of the data matrix X The covariance matrix S of X is as follows:  (b) The data are converted to principal component (PC) vector Z using W , such that Z = Wx. (c) The risk clusters are generated based on Z using the k-means method [19].
(i) If the set of indices of x i belonging to the j-th cluster is I j , the center of gravity G j of the cluster is G j = 1 |I j | i∈I j x i . (ii) For each x i , calculate the distance from the center of gravity and repeat assigning to the cluster with the closest distance.
(i) Consider the following optimization problem for a map φ : R c → R c , Maximize f (a) = n k=1 a k − 1 2 n k=1 n l=1 a k a l y k y l K z ′ k , z ′ l .
(b) Validation of predicted turbulence-occurrence dates.

Dimensionality reduction and coordinate transformation in PCA
PCA was employed to determine the factors that cause turbulence. Figure 5 depicts a plot for each observation date, with PCs 1 and 2 forming the x-and y-axes, respectively; the points indicated by arrows represent the three actual instances of turbulence. Flights with turbulence are plotted in the upper-right part of the figure. Figure 6 illustrates a scatter plot of the elements comprising the first and second PC planes. As can be observed, the wind speed, contour lines, and trough elements are concentrated in the upper-right quadrant of the PC1-axis; the temperature elements, in the upper-left. It can be inferred that the farther the PC1 lies on the right-hand side, the higher the wind speed and the lower the temperature (i.e., there are many contour lines). Further, the humidity elements are present at the top of the plot, while wind direction and cloud height elements at the bottom indicating that in case of high humidity, the wind direction is negative. Thus, when most of the wind is from the west, on occurrence of turbulence, the wind from the southwest exerts a significant  Table 4. Figure 5 reveals that in PC4, troughs and cloud height were major influencing factors on the days when the turbulence occurred. Further, it can be observed that wind speed difference significantly influences the PC5. The cumulative contribution rate from PC1 to PC2 was determined as 43.23%, with 13 components required to achieve a cumulative contribution of at least 80%. Therefore, 13 PCs were considered to obtain the matrix W that performs coordinate transformations based on Z = Wx . Here, x is the original data and Z is the coordinate after transformation.

Calculation of risk clusters by k-means method
Using the coordinate transformation matrix obtained from the PCA described in the previous Section, the risk cluster was calculated employing the k-means method, wherein the Z coordinate transformed via W was used. Figure 7 shows the resulting classification into six clusters. Clusters where turbulence was expected to occur are indicated in red, and included almost all the dates on which turbulence was observed, as presented in Table 1. However, although Cluster ID 5 might have been  affected by turbulence, it did not significantly affect flight operations. Moreover, it is also probable that the other clusters were less affected by turbulence. Figure 8 presents a comparison of the risk clusters with other clusters. It is evident that the risk clusters exhibit faster wind speeds, lower temperatures, lower humidity, and larger wind speed differences. Further, T-test or Welch's test conducted on the risk cluster and the other clusters showed that the p-value was less than 0.05, confirming that the means of the two groups were significantly different for all the items in Fig. 8.

Result and discussion
Turbulence prediction for validation data The risk clusters described in the previous chapter were used to predict the occurrence of turbulence using 179 data points that were collected from Table 3 in the year 2019. Following the normalization of this data, axis transformation was performed using the transformation matrix W described in the previous section.

Calculation of risk date using the risk cluster via SVC
The risk cluster was used as the training data to predict the turbulence dates for the 2019 data using SVC. Table 5 lists the validation data and SVC parameters. Figure 9 presents a comparison of the days that were predicted to exhibit turbulence with those that were not. It is evident that the distributions of wind speed and temperature are similar to those of the risk cluster. For fx106-03-500-shear, all values between 10/01/2019 and 12/31/2019 equaled 0. It was concluded that the days with predicted turbulence exhibited strong wind speeds, low temperatures, and large wind speed differences.

Verification of forecasted turbulence dates via SVC
Through the use of a weather map, the days with the risk of turbulence predicted using the risk clusters and SVC were verified; the results are summarized in Table 6.
Turbulence risk was assigned based on four levels, categorized in increasing order of risk: 1 (normal), 2 (caution), 3 (warning), and 4 (critical), to render it easier to propose to airlines. Herein, the highest risk was observed on January 9, 2019, when flight cancellations were considered. Moreover, even on the dates when the turbulence risk level was at least two, passenger safety, if not flight cancellation, were seriously considered. Therefore, it was confirmed that this analysis can adequately predict turbulence-risk days. Figure 10 presents a comparison of the per-minute average of the maximum standard deviations (SDs) of the vertical sway of the aircraft [37,38] obtained from actual QAR data against those of the predicted turbulence date calculated via SVC and the other days. However, 02/04/2019 was excluded because QAR data could not be obtained for the said date. As can be observed, from the left, the graph shows the average of the maximum SDs of the vertical sway, and the values concerning its climb and descent. Further,

Comparison with other methods
The proposed method was compared with other methods. Table 7 shows the results of validation of the data in Table 3 using the cross-validation method (K = 10), where the records for the days when no turbulence occurred were set as true, and the accuracy was the highest among the methods used. The results of all methods and models show that the FN: False Negative item, which detects the days when turbulence is observed, is 0, thereby indicating that turbulence occurrence was not detected.

Conclusion
This study used open data to predict the occurrence of turbulence to render aircraft operations safer and more comfortable. Although turbulence occurs infrequently, it is a leading cause of aircraft damage and changes in flight schedules. The findings of this study are twofold. First, following the confirmation of the statistical information using the risk clusters, they were used as supervisory data to make appropriate predictions even for low frequency events such as turbulence. Moreover, the turbulence-risk cluster was derived through k-means clustering after reducing the dimensions of available data via PCA, instead of using the rare instances of turbulence as the training data. In addition, the process of creating risk clusters provided an opportunity to examine the factors that influenced turbulence occurrence. In the case of high-risk events such as aircraft operations, this can have a synergistic effect with the experience and knowledge of the pilots themselves. Further, using this turbulence-risk cluster as training data, the turbulence occurrences for 2019 were predicted through SVC, with the obtained results being confirmed to be sufficiently accurate for utilization by pilots. Second, it was found that using open data, the prediction of turbulence occurrence was possible. Further, the meteorological data used in this study is routinely used by pilots and airlines, and thus can be used at airports other than the one covered in this study.