A Machine Learning Approach to Safer Airplane Landings: Predicting Runway Conditions using Weather and Flight Data

The presence of snow and ice on runway surfaces reduces the available tire-pavement friction needed for retardation and directional control and causes potential economic and safety threats for the aviation industry during the winter seasons. To activate appropriate safety procedures, pilots need accurate and timely information on the actual runway surface conditions. In this study, XGBoost is used to create a combined runway assessment system, which includes a classiﬁcation model to predict slippery conditions and a regression model to predict the level of slipperiness. The models are trained on weather data and data from runway reports. The runway surface conditions are represented by the tire-pavement friction coeﬃcient, which is estimated from ﬂight sensor data from landing aircrafts. To evaluate the performance of the models, they are compared to several state-of-the-art runway assessment methods. The XGBoost models identify slippery runway conditions with a ROC AUC of 0.95, predict the friction coeﬃcient with a MAE of 0.0254, and outperforms all the previous methods. The results show the strong abilities of machine learning methods to model complex, physical phenomena with a good accuracy when domain knowledge is used in the variable extraction. The XGBoost models are combined with SHAP (SHapley Additive exPlanations) approximations to provide a comprehensible decision support system for airport operators and pilots, which can contribute to safer and more economic operations of airport runways.


Introduction
Contamination of runway surfaces with snow, ice or slush causes potential economic and safety threats for the aviation industry during the winter seasons. The presence of these materials reduces the available tire-pavement friction needed for retardation and directional control, which can lead to accidents and loss of human lives [16,33]. During 2019, seven commercial passenger aircrafts ran out of the runways during landing in the United States due to bad weather and runway conditions [13]. Difficult landing conditions is not only a problem at northern airports. On 7th August 2020, an aircraft suffered a runway excursion in India during poor weather conditions, and both pilots and 19 passengers died in the accident. Difficult weather conditions such as snow and rain also contribute to the increasing growth of delayed and cancelled flights [63].
If the aviation industry returns to the growth trajectory it had before COVID-19, the global air transport demand is expected to triple within the year 2050 [17], which will increase the need for more efficient and safer operations of airports runways. This has led the global aviation industry to work towards more standardized and data-driven assessment of runway conditions [36], which pilots need to activate appropriate safety procedures when landing and at take-off. Information about the available friction on the runways are given to pilots in international standardized runway reports. Unfortunately, the accuracy of runway reports can sometimes be unsatisfactory, and measuring the runway friction with an acceptable precision is difficult [46,5]. While many different friction measurement devices have been developed, it is hard to * Corresponding author find equipment that produces stable and consistent results which corresponds to the experienced braking friction for landing aircrafts [51,52]. Another problem is that in order to measure friction, the runway must be closed for traffic. Thus, such measurements cannot be carried out too frequently. Especially heavy snowfalls or sudden drops in temperature may result in rapidly changing conditions. As a result, the runway reports are not as useful as one could hope.
There have been several studies that attempt to make the measurements from the friction measurement devices more useful. They relate the ground friction measurements to the aircraft braking friction using correlations or adjustments [53,7,26]. However, the inconsistency and variance between friction measurement devices and different airports is a problem. Other studies try to measure the surface conditions using other measurement techniques [29], including acoustic [4,34], tread [12,45] and optical [19,60]. In the development of better anti-skid brake systems, there have also been studies on using sensor data of landing aircrafts, such as wheel speed and brake force, to provide real-time estimation of the available braking friction force [23,24,38]. The problem is that all these sensor methods depend on real-time measurements from sensors attached to the aircraft, which measure the relevant parameters as the vehicle challenges the friction. Pilots need to know the surface conditions prior to landing, meaning these methods are not useful in this case.
In order to predict the surface conditions before aircrafts hit the ground, there have been conducted studies on how available surface friction is affected by weather conditions and runway contamination based on engineeringand physics-based models and basic statistical approaches. Klein-Paste et al. [32] proposed a runway model for the surface conditions which interprets descriptive data from the international standardized runway reports called Snowtam reports. The model evaluates a sum of seven effects that contain information about the runway contamination as well as measurement of runway temperature and humidity, P = 7 i=1 x i . The first effect x 1 sets the main assessment of the runway conditions in the interval [1,5] by evaluating the form of contamination on the runway. Then, the assessment is either upgraded or downgraded by considering the next six factors, which have values in the range of [−2, 2]. This includes the effect of spatial coverage x 2 , the depth of contamination x 3 , runway temperature x 4 , humidity x 5 , and the use of chemicals x 6 and sanding x 7 . The output of the model is a prediction P of the runway braking action, which is the international format for specifying runway conditions and is described in Table 1. When P exceeds 5 it is set to 5, and when it is lower than 1 it is set to 1.
Juga et al. 2013 [27] predicted surface friction on traffic roads using linear regression models with weather data, which can be partially related to the surface friction on airport runways. The models use the road surface temperature and thickness of contamination as input and predict the friction coefficients, where CF si , CF w and CF d represent the friction coefficient for snowy/icy, wet and dry runways, T r is the runway temperature, X S , X I and X W are the thickness of snow, ice and water layers, and a i , b i , c i and d i (i ∈ {1, 2}) are the regression coefficients. The model is used in the road weather model RoadSurf in Finland, which simulates road surface temperatures, conditions and friction coefficients to assist in traffic safety and winter road maintenance [28]. Another study on surface friction on traffic roads is done by Kim et al. [31], where the friction coefficient on roads is predicted during rainy weather. This is done with an artificial neural network using rainfall intensity, water film thickness, and road surface temperature as input variables and the runway friction coefficient as response.
Huseby & Rabbe 2012 [20] introduced a scenario-based model for assessing airport runway conditions using weather data, which defines a set of scenarios known to cause slippery conditions. By monitoring the meteorological parameters runway temperature, air temperature, relative humidity, horizontal visibility, and precipitation type and intensity, the model detects slippery scenarios. As an example, the scenario SNOW is one that can happen quite often at northern airports, and the precise mathematical conditions for this scenario are: • pt i ∈ {snow, sleet, drifting snow} at least once, i ∈ I 4 , where pt i is the precipitation type at time i, ta i and tr i are the air temperature and runway temperature, hu i is the relative humidity, and I 4 is a time slot containing the last four hours from the given point of time. The scenario model was further developed in Huseby & Rabbe [21], where it was shown that the scenario model could be improved by optimizing the thresholds for the scenarios according to Type 1 and Type 2 errors using weather and flight data. Both Huseby & Rabbes [20] scenario model and the runway model of Klain-Paste et al. [32] are used in an integrated runway information system called IRIS, which is implemented on 16 Norwegian airports to support in safer operations of Norwegian airports. The complexity and non-linearity of the physical relationships controlling the surface friction, and their dependency on each other through time, makes it difficult to provide precise physical models of the runway surface friction. Machine learning methods have in several occasions shown to be able to model complex physical phenomena with a good accuracy. The main objective of this paper is to study if machine learning methods can predict runway surface conditions with a higher precision than previous methods. This is done by using XGBoost to create a combined system of a classification model and a regression model trained on weather data, data from runway reports, and sensor data from landing aircrafts. As most tree ensemble methods, tree-based XGBoost with deep decision trees creates highly complex models and is regarded as a black-box algorithm. Therefore, we use SHAP to create simplified, understandable models that provides both global and local explanations of the models' predictions. All the models are combined to create a data-driven decision support system, which can aid airport operators and pilots in their decisions and contribute to safer and more economic operation of airports. To evaluate the performance of the XGBoost models, they are compared to state-of-the-art runway surface conditions assessment models, namely the runway model of Klein-Paste et al. [32] and the scenario model of Huseby and Rabbe [20]. The models are also compared to runway assessment of airport runway inspectors reported in the Snowtam reports. The rest of the paper is structured as follows: In section 2 we briefly describe the data and sources used in this work, and how the response variable and explanatory variables are extracted. In section 3 we describe XGBoost and SHAP, which is the methodology used to create the models. In section 4 we evaluate the performance of the models and compare them the runway model, the scenario model and the assessment from runway inspectors. We also describe the XGBoost models by using SHAP values to create global explanations. In section 5 we introduce the decision support system, which uses output from the XGBoost models together with local explanations of the predictions. In section 6 we sum up the work and add our conclusive remarks and future work.
2 Data sources and variable extraction

Data sources
All data used in this study are made available by Avinor, the largest airport operator in Norway. The full data set includes weather data, runway reports and flight data from 16 Norwegian Airports. There are significant differences between these airports with respect to weather conditions, maintenance procedures, runway lengths, and traffic. To avoid possible effects of these differences on the analysis, separate models should be fitted for each airport. This study focuses on data from Oslo Airport, Norway's largest airport.
The weather data comes from measurement devices at the airport, which measures meteorological variables every minute such as wind speed, temperature, humidity, and precipitation.
The runway reports, called Snowtam reports, are created by the airport operators and include descriptive information about runway contamination such as type and depth and maintenance procedures such as sanding of the use of chemicals. A new Snowtam report must be issued at least every 24 hours or when the runway conditions change significantly.
The flight data, collected over ten winter seasons from season 2009/2010 to season 2018/2019, is provided by Scandinavian Airlines Service (SAS) and Norwegian Air Shuttle AS and is gathered from the Quick Access Recorder (QAR) of Boeing 737-600/700/800 NG airplanes. The flight data for Oslo Airport consists of 200 508 landings. The flight data is used to estimate the friction coefficient and calculate whether a landing is friction limited or not, as described in Section 2.2. For each landing, the data consists of 60 seconds of measurements such as acceleration, brake pressure, flap position, and engine thrust starting from touch down.

Calculating the response variable
To reflect the airport runway conditions, the aircraft braking coefficient, µ B , is calculated using a performance model developed by Boeing. µ B is defined as the ratio of the tangential force needed to maintain uniform relative motion between the aircraft's wheels and the runway surface. The calculations are based on the equation of motion of a moving vehicle where m is the weight of the aircraft, dv dt is the acceleration, D thrust is the force caused by thrust, D aero is the Friction limited and µ B ≤ 0. 15 5 163 aerodynamic drag, g is the gravitation, is the slope of the runway, and D brakes is the force contribution from the wheels. The contribution of D thrust , D aero , and D brakes can be calculated using aircraft-type specific performance models, and µ B can then be retrieved from D brakes where L is the aerodynamic lift. At Oslo airport, was set to zero as the contribution in retardation due to slope was negligible. More details about calculating the braking coefficient can be found in Midtfjord & Huseby [44] and Klein-Paste et al. [33]. An important problem when analysing flight data is deciding whether a landing is friction limited or not. Unless the pilot challenges the runway friction during the landing by fully applying the brakes, the maximum friction available will not be utilized. In this case, µ B reflects the amount of tire-pavement friction that was used. When wheel brakes are applied fully or to a high degree on slippery runways, the maximum attainable friction from the runway is used during the stop. In this case, the aircraft's deceleration is limited by the friction available from the runway, and the obtained µ B will reflect the amount of tire-pavement friction that is available.
To figure out if the brakes are applied fully during a landing, we check whether the brake pressure "requested" by the pilot exceeds the brake pressure estimated based on the measured deceleration. Whenever this occurs, the anti-skid system is activated, and all the available friction is used. If these conditions last for at least 3 successive seconds, the landing is said to be friction limited. Since the braking coefficient then reflects the available tirepavement friction, we refer to it as the friction coefficient.
In the first part of our system, we want to classify whether a landing is slippery or not, i.e. we want to get a warning when the runway conditions are not good. If a landing is friction limited, this indicates that the runway conditions may not be optimal. However, this does not necessary imply that the runway conditions are bad. The friction coefficients can be converted to the corresponding braking actions by using Table 1. Landings which are medium-good or good, i.e., where µ B > 0.15, are considered to be non-slippery, even though they are friction limited. Table 2 shows the distribution of the landings at Oslo Airport at the winter seasons 2009/2010 until 2018/2019, where the landings are slippery for 5 163 of the 200 508 landings, which is only 2.57% of the landings.
In the second part of the system, we want to predict how slippery it was during the slippery conditions. This is done by using a XGBoost regression algorithm on the friction coefficients for the friction limited landings, and then converting the predicted friction coefficients to braking actions according to

Extracting the explanatory variables
The effect weather has on the runway surface conditions is complex as it is highly dependent on the interaction between multiple weather variables over time, as well as the maintenance of the runways. It is not enough to simply consider the present weather; it is also necessary to know how the weather has been backwards in time and what kind of maintenance operations has been carried out on the runway in the meantime.
One way to include both some information about maintenance operations on the runway as well as weather development some time backwards from the present is to include data from the Snowtam reports in the variables. The reports include information about the maintenance actions sanding and the use of de-icing or anti-icing chemicals on the runways. The reports also contain information about runway contamination such as snow, rime, or ice, as well as the depth and coverage of the contamination. By using the reports, it is possible to gather knowledge about past precipitation and temperature development. However, since the Snowtam reports are issued only one to a few times per day, they do not provide information about rapid changes. Therefore, real-time information about weather development backward in time should be drawn from measurements of meteorological variables in addition to the data from the Snowtam reports. This was done using the following observations: • pt = Precipitation type • dp = Dew point After some explorative analysis, it was decided to include measurements of several weather variables not only at the present time, but also k ∈ {1, 3, 6, 12, 24} hours back in time: where x i,k denotes variable x at k hours backwards from time i and x ∈ {pi, ta, tr, hu, vi, ap, dp}. To include the trend of some relevant variables over time, we take the difference between the present value and their values k hours back in time: where x ∈ {tr, hu, ap}. These variables were chosen as their trend might affect surface conditions, especially when large changes occur. In addition, precipitation over time was included by accumulating their intensity: Compacted or rolled snow 9 Frozen ruts or ridges where pt ∈ {rain, sleet, wet snow, dry snow} and I {ptj =pti} is the subset where the precipitation type is of the type pt i between times k and i. In addition to the mentioned variables, present measurements of along wind and across wind were also included in the explanatory variables. We have also included the absolute value of air temperature and runway temperature, as temperatures closer to zero can lead to difficult runway conditions, independent of the sign.
Another challenge when working with weather data and runway reports are the categorical variables. Especially the contamination type in the Snowtam reports has a complex setup; it consists of nine different contamination codes given in Table 3, where the final category can be a combination of several layers. An an example, the contamination code 479 means Dry snow on ice on Frozen ruts or ridges. The multiple layers consist of maximum one "loose layer" and maximum two "solid layers". One way to make these combinations more useful is to create groups of contamination codes. Klein-Paste et al. [32] divided the different combinations of contamination codes into six groups based on their slippery characteristics, and used the groups in further calculations in the runway model:

• Loose and dry Contaminated
• Solid base layer One combination of contamination codes can occur in several of the groups. Another way to decrease the number of possible combinations is to narrow down to report only two layers, which is the future approach the international format for specifying runway conditions is going to take [55].
One benefit of XGBoost, which is the machine learning algorithm used to train the runway surface condition predictor in this work, is that is handles sparse data well, as it uses a sparsity-aware split finding algorithm [9]. Therefore, it is possible to enter the complex, categorical variable contamination type as several one-hot encoded vari-ables, one for each possible combination of contamination codes. One-hot encoding is a transformation of the original variable with N possible states to N binary variables, one for each possible state. The variable contamination type was transformed to 30 one-hot encoded variables. The same one-hot encoding was done for the weather variable precipitation type, which is also a categorical variable with nine categories. The final matrix with explanatory variables consisted of 151 variables.

eXtreme Gradient Boosting
In this paper, we build prediction models using the stateof-the-art boosting algorithm XGBoost [9], to predict runway conditions using weather data and runway reports. XGBoost stands for eXtreme Gradient Boosting and is a scalable implementation of gradient boosting decision trees [14]. Since its release in 2014, XGBoost has been a very popular machine learning method, and it has a highly impressive winning record when it comes to machine learning competitions. XGBoost has already been used in several transportation risk assessment applications both within road traffic [48,58,57], aviation [39], and shipping [18,25,3].
XGBoost is a supervised learning method, so it derives a model f (x) that relates m input variables x to an outcome of interest y. This is done by minimizing a loss function L(y, f (x)) that penalizes differences between y and f (x). As a boosting approach, XGBoost does not minimize the loss function at once, but by small steps. This is done by iteratively fitting a weak learner, in this case a penalised version of a decision tree, to the gradient of the loss computed at the previous iteration. The final model estimatef (x) will have the form where f k (x) is the decision tree computed at the k iteration. In contrast to other ensemble methods like bagging and random forests, a boosting algorithm learns from the results of the previous iteration. In this way, the algorithm can focus on the most interesting data structures, and the space of the possible models is better explored.
In practice, the model must be learned from the data, which in general consist of a n (number of observations) times m (number of variables) matrix of input X and a n-dimensional vector of outcomes y. At each iteration, a decision tree f k (x) is derived by minimizing an objective function is the current estimate of the model (i.e., the model computed at the previous iteration k − 1), and Ω(f k (x)) is a penalty term that penalizes the tree complexity.
Basically, at each iteration k XGBoost looks for the Due to the boosting requirement of a weak learner, the optimization is constrained by Ω(f k (x)), such that simple trees are favoured. Once the best tree f k (x) is obtained, its contribution is added to the current model, Note that the k-th contribution to the final model is actually shrunk by a factor ν (step size shrinkage), which reduced the convergence speed and therefore fulfils the boosting requirement of making only a small improvement to the model at each iteration. Part of the success of XGBoost lies in its clever way to perform the optimization above. Instead of working directly with Eq. (7), the optimization is performed on its second order approximation where The key point is that the construction of the decision trees, namely the identification of the split points and the leaf weights, only depends on the loss function through these two gradient terms, which makes the computations easier. , also helps the computations, as it associates a penalty parameter γ to the computation of the split points and a penalty parameter λ to that of the leaf weights. The former parameter penalizes the number of tree leafs T , the latter the magnitude of the weights w, with || · || denoting the L 2 norm. Another relevant feature implemented in XGBoost is data subsampling. In order to prevent overfitting, i.e. training too complex models that incorrectly model random noise as important parts of the models, only a random part of the n observations are used in the tree fitting process steps. As a convenient consequence, the computations are also speeded up. More details on XGBoost can be found in Chen & Guestrin 2016 [9].
The general framework of XGBoost works for any kind of response variable, provided that a suitable loss function is implemented. In the first part of our system, that deals with a binary classification problem (slippery / nonslippery), we will use a logistic loss function, where y i is the true class for the observation i andŷ i is the predicted probability of instance i to be of class 1, which is calculated asŷ In the second part of our system we will have a continuous regression on the friction coefficient, so we use squared error as the loss function, namely where y i is the true friction coefficient for instance i and y i =f (x i ) is the predicted friction coefficient.
The excellent performance of XGBoost, its scalability, and fast calculations are among the reasons why XGBoost was chosen to train the surface condition predictor in this work. In addition, as XGBoost is an ensemble of decision trees, its performance is not affected by multicollinearity (highly correlated explanatory variables) [50], which is highly present in our data. Especially between the variables created in Section 2.3, which are different time variants of the same variable or a function of other existing variables, such as ∆ k x i .

Parameter tuning and model evaluation
When working with machine learning methods, parameter tuning is an important part of training the models. For example, finding good values for the penalties λ and γ are important to both prevent overfitting, which happens when λ and γ are too small, and underfitting, which happens when λ and γ are too large. Underfitting means training of too simple models that do not capture the data structures.
Model fitting, parameter tuning and model evaluation must be computed on different data. In this paper, we use a ten-fold nested cross validation, which is a method for model training, tuning and evaluation that is shown to provide an approximately unbiased estimate of the true model error [61]. The data are divided into ten folds, that are used in turn as a test set to evaluate the model trained in the other nine folds. The mean of the evaluation measure obtained in the ten test folds is regarded to be the performance of the model.
In each of the ten repetitions, the collected data from the nine training folds are again divided into tree folds to pursue a cross validated randomized search for tuning the parameters of the XGBoost model. The model is trained on two parts of the data with different combinations of parameters and evaluated on the third, which is repeated for all three folds. The parameters that give the best mean performance over all three folds are chosen. Five parameters are tuned with four settings for each of the parameters, where the settings are sampled from a distribution of possible values shown in Table 4. This means that a total of 20 random combination of parameters are evaluated. n estimators is the number of decision trees in the model, that we indicated with K in the equations in Section 3.1. reg lambda and min split loss are the regularization parameters λ and γ, respectively. subsample is the ratio of the data that is used in the data subsampling mentioned earlier. Finally, learning rate is ν, the step size shrinkage used at each boosting step.

Shapley Additive Explanations
The models created by XGBoost gets to be quite complex, as they combine scores from between 50 and 250 decision trees, making it difficult to understand how they makes their predictions. The increased use of black-box algorithms such as XGBoost and deep neural networks has  [2]. This involves methods for creating simpler explanation models, which are interpretable approximations of the complex black box models. There are a lot of reasons why it is important to have some understanding of how a system works. This includes gaining trust in the system, giving insight into how the system could be improved, allowing us to learn from the system, and monitoring possible errors in the data or models. One method to get some insight into the decision basis of a machine learning system is by using SHAP (SHapley Additive Explanations), the state-of-the-art method for creating local explanations for machine learning models [42]. Local explanations mean explaining why a specific got its prediction, which SHAP does by using Shapley Values from cooperative game theory [56]. The variables are the players in the game, while the game is to predict if the runway conditions are slippery, or how slippery, in the case of the regression model. The goal of using shapley values is to distribute the prediction among the variables. This makes Shapley values part of the additive feature attribution methods, which means they have an explanation model that is a linear function of binary variables: where z ∈ {0, 1} M is a coalition vector giving the absence / presence of the input variables in x and M is the number of variables in the original model. Methods with this explanation model assign an importance effect φ j to each variable and summing the effects of all variables approximates the output of the original model. Several of the popular local explanation methods share this additive feature attribution method, such as LIME [54], DeepLIFT [59], and Layer-Wise Relevance Propagation [6]. The way Shapley values are calculated for variable j for a model f (x) on instance i is: and S is the set of non-zero indexes in z. This approximation of f S (x S ) is done to account for the missing values in x S . SHAP values are theoretically optimal and are, according to Lundeberg & Lee, the only possible consistent feature attribution method. But as a lot of theoretical optimums, they can be difficult to calculate. That's why Lundeberg et al. [41,40] derived an algorithm specific for tree ensembles that reduces the complexity of computing exact SHAP values for these kind of model structures. The algorithm is called Tree SHAP and is the explanation method used to explain the XGBoost models in this work.
As we are interested in knowing how our models work, we use the interventional approach to handle correlated variables. This means that we intervene on variables to break dependencies between dependent variables according to the rules of causal inference [22]. In practice, this is done by approximating where do is Pearl's [49] do-operator. This operator simulates physical interventions by replacing certain functions or values from a model with a constant X = x, while keeping the rest of the model unchanged. The effect of this is that our explanations becomes true to the model instead of true to the data, which is further discussed in Chen et al. [8].

Performance classification model
As the dataset is highly imbalanced, with only 2.57% slippery landings, using accuracy as a performance evaluation metrics for the binary classification is not a good option. Instead, the XGBoost classification model is evaluated by using confusion matrices, which show the amount of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) predictions, where slippery is regarded as positive and non-slippery as negative. The first two columns in Table 5 show the confusion matrix for the predictions from the XGBoost classification model, where the columns are the predicted classes and the rows are the actual classes. As seen in Eq. 12, the output from the classification model are probabilities and not binary classifications, so the predictions are converted to binary classifications by using a threshold value for the probabilities. To account for the unbalanced dataset, 0.0257 was Table 6: Results from the prediction of slippery conditions from the different classification methods, where the highest and lowest value in every row is marked in green and red.

Metric
XGBoost To evaluate the performance of the XGBoost model, it is compared to the prediction from the runway model and the scenario model explained in Section 1, as well as the reported surface conditions assessment in the Snowtam reports done by runway inspectors. The runway model is mainly implemented according the the paper by Klein-Paste et al. [32], but includes the latest updates according to the operational IRIS system. One additional change has been carried out, which is removal of the rule that contamination coverage less than 10% automatically provides a braking action of 5, as we did not have a stable data source on this variable. As mentioned earlier, both the runway model and the Snowtam reports provide the surface conditions on a scale from 1-5. In Table 5 we regard these methods to report slippery if the braking action is in the interval 1-3, meaning medium or less. The scenario model is implemented according to the paper by Huseby & Rabbe [20] and is set to report slippery if it gives any warnings of slippery scenarios.
One observation from Table 5 is that the models have different strengths and weaknesses. Since the scenario model is created to be a warning system, it has a high focus on identifying most of the slippery landings, even though some false warnings might happen. As a result, the scenario model gives a high amount of true slippery incidents, but is the method that misses the most on the non-slippery incidents. The runway model is more conservative than the scenario model. It gives a higher amount of true non-slippery incidents, but it misses the most on the slippery incidents. This contra-dictionary behavior could come from the motivation of the models, as they were initially created to be two parts of the same runway assessment system that fulfil each other. The assessment from the runway inspectors is the most conservative prediction, and is the method that gives the highest amount of true non-slippery landings. One reason why these as-sessments give more conservative predictions, could be the rarity in their updates. Good conditions are often more stable and can last for longer times, while difficult conditions can come and go more rapidly. The XGBoost model is optimized with the intention to balance the amount of true slippery and true non-slippery landings in the optimal way, and is the methods that gives the highest amount of true slippery landings, while at the same time gives a high amount of true non-slippery landings.
To see the difference in performance between the four methods more clearly, we use some commonly used performance evaluation metrics for imbalanced datasets based on the confusion matrices: • Sensitivity: Sensitivity TP TP+FN is the ratio of true positive predictions to the total amount of actual positive incidents, and gives the percentage of slippery incidents that were classified correctly.
• Specificity: Specificity TN TN+FP is the ratio of true negative predictions to the total amount of actual negative incidents, and gives the percentage of nonslippery incidents that were classified correctly.
• G-Mean: The geometric mean √ Sensitivity * Specificity is a combined metric that balances the sensitivity and the specificity.
The results for the prediction models and reported runway assessment in terms of these performance evaluation metrics are given in Table 6. We see that XGBoost outperforms all the other methods in the amount of correctly classified slippery incidents with 92% sensitivity, while the runway model has the lowest sensitivity at 76%. XGBoost also outperforms the other two prediction models on correctly classifying the non-slippery landings, namely 85% of these, compared to 60% for the scenario model. But the conservative assessments from runway inspectors correctly classifies more of the non-slippery landings, namely 89%. The overall G-mean is still better for the XGBoost model, which has an improvement of 19% in true slippery incidents with only a 5% loss in true non-slippery incidents compared to the runway inspectors.
There is always a tradeoff between False Negatives (Type 1 error) and False Positives (Type 2 error), therefore one should consider the consequences of doing the two types of errors. As the models developed in this setting are primarily meant to work as warning systems, giving warnings when there might be slippery conditions, there is no doubt that avoiding Type 1 errors is the most important factor. If a pilot is not warned about actual bad runway conditions (Type 1 errors), accidents may happen. On the other hand, a warning system that gives too many warnings (Type 2 errors) might not be taken seriously. One benefit of the XGBoost model is that it predicts the probability of a landing to be slippery. Using these probabilities, the user can decide the threshold value for landings to be regarded as slippery, thus altering the probability of the system to make the two different types of errors. A visualization of this is the Receiver Operating Characteristics (ROC) Curve, which plots the sensitivity (also called the True Positive Rate, TPR) vs. 1 -specificity (also called the False Positive Rate, FPR) for different threshold values. A plot of the ROC curve for the XGBoost model is given in Figure 1, which shows that allowing a higher FPR provides a higher TPR.  Table 6. The closer the points/line is to the upper left corner, the better the performance.
A metric for measuring model performance using ROC curves is calculating the area under the curve (AUC). The area of 1 gives a perfect prediction, while the area of 0.5 (the area under the red dotted line in Fig. 1) is a model as bad as random guessing. The XGBoost model achieves an area of 0.949, providing a high performance close to 1. The standard deviation in ROC AUC for the ten folds was 0.005, meaning we have quite consistent results with a relatively small variance in performance between the folds. As the runway model, the scenario model, and the Snowtam reports gives direct classifications and not probabilities, it is not possible to create ROC curves from these methods. They have a fixed TPR and FPR and their performances are plotted as points in Figure 1. The performance of the XGBoost classification model with the threshold used in Table 5 and 6 is plotted as a blue point on the ROC curve. We notice that all methods perform much better than random guessing, as they are long above the red dotted line. The prediction from XGBoost has both a higher TRP and a lower FPR than both the scenario model and the runway model. The reported runway assessments are also below the blue curve, meaning that XGBoost outperforms the reports, one only has to choose a threshold value according to the desired effect. If we want the XGBoost prediction to have the same TPR as the Snowtam reports (a sensitivity of 0.78), the XG-Boost prediction has a specificity of 0.94, which is higher than the specificity of 0.89 for the Snowtam reports. The results show that the XGBoost model has indeed found patterns and relationships not covered by the knowledgeand engineering-based scenario model and runway model and outperforms them on all the metrics. The model also shows its usefulness when it not only matches human assessment from the runway inspectors, but actually exceeds it.

Performance regression model
For the friction limited landings, the friction coefficient reflects the amount of tire-pavement friction that was available. This means that we do not only know if it was slippery or not, we also know how slippery it was, and can use the estimated friction coefficients as a response variable when training the XGBoost model. Predicting the friction coefficient is done using the loss function given in Eq. 13 on the friction limited landings.
The performance of the XGBoost regression model is given in Table 7 in the form of root mean squared error (RMSE), mean absolute error (MAE), and BA Error, which are defined as whereŷ i is the predicted friction coefficient, y i is the true friction coefficient and BA(y i ) converts the friction coefficients to braking action using Table 1. The MAE reflects the mean deviation of the predicted friction coefficient from the true friction coefficient, which is 0.0254 for the XGBoost regression model. The BA Error reflects the mean number of braking action category the model misses with. As the runway model and the reported runway assessments give the predicted runway surface conditions only in braking actions and not in friction coefficients, RMSE and MAE cannot be obtained for these models, and we compare the models using the BA Error. The scenario model only provides a binary classification (slippery / non-slippery) and not the level of slipperiness and is therefore not relevant in this setting. We observe that the XGBoost regression model misses at average with a little more than a half category (0.54). This is lower than both the prediction from the runway model and reported runway assessment from the Snowtam reports, which at average misses with 0.84 and 0.71, respectively. The exact distribution of deviation from the true class is given in Figure 2. The deviation BA(ŷ i ) − BA(y i ) shows the number of categories the prediction deviated from the conditions experienced during the landing. A deviation of zero means the prediction was correct, while a positive deviation shows that the experienced conditions was worse than predicted. The deviations within ±1 is marked with blue dotted lines. The figure shows that the XGBoost regression model has both a higher number of correctly classified landings (deviation 0) than the runway model and Snowtam reports, and has a higher percentage of the prediction within ±1 deviation. The regression model predicted 93% of the conditions within ±1, while the runway model and runway inspectors predicted this 83% and 87% of the times. XGBoost manages to outperform the other methods also when it comes to predicting the level of slipperiness.

Model discussions
The performance of the runway model in Figure 2 corresponds closely with the performance given in the paper by Klein-Paste et al. [32], where the runway model predicts 86% of the conditions within ±1 on a data set containing 1 261 friction limited landings in the winter seasons 2008/2009 to 2010/2011. This indicates that our alternations of the runway model described in Section 4.1 did not have too much effect on the model performance. The performance of the assessment from runway inspections however, seems to have improved over the years, as they only had 77% of the conditions within ±1 in winter seasons 2008/2009 to 2010/2011 [32] compared to 87% over the seasons 2009/2010 to 2018/2019. The main reason for this is probably the increased focus at Norwegian airports to improve the quality of the runway assessment and runway reports over the last years, which seems to have been gainful.
There are several reasons why it is not possible to achieve a perfect AUC of 1.0 and BA Error of 0 for the XGBoost model, the most important being that both the data x i and the response variable y i are subject to bias and measurement errors. As we are working with big data and several hundred thousand landings, it is not possible to investigate every flight, weather sensor measurement and Snowtam reports for errors. But we do know that measurement errors happen, especially in the sensors of the landing aircrafts, which affects the estimated response variable. In addition, difference in pilot behavior most probably have a contribution to inaccuracy in whether a landing is friction limited or slippery, as some pilots might brake harder and challenge the friction under the same circumstances as others might not. Other influential factors could be the characteristics of the tires of the aircrafts, such as tire pressure, load, wear, and deformation [46]. These are factors that affect the tire-to-pavement friction and could disturb the calculations of the friction coefficient from the flight sensors. All these possible sources of imprecision make it difficult to achieve a perfect score by using these data sources. The weather sensors provide in general more reliable measurements than the flight sensors. This gives us reason to think that parts of where the XGBoost models make large errors compared to the response variable, there could be wrong estimation of the response variable, and that our models might give a correct prediction.
A remark regarding the prediction of the braking action is a newly agreed transition in the international standardized Snowtam reports. The scales of braking action is currently transitioning from a five point scale to a six point scale [55,36]. Luckily, our models can easily be used with this new scale as well. Using the already trained XGBoost models, one only has to transform the predicted friction coefficients to the new braking action categories by using the new thresholds.

Global explanations of the models
SHAP values are created to give local explanations, meaning they provide information about why a single prediction happened. But with the high-speed estimations of SHAP values provided by Tree SHAP, it is possible to provide local explanations of entire datasets. Plotting local explanations for a whole test set provides information about how the model works as an entirety, for all the predictions. This means that the local explanations can be combined to give global explanations of the models. Figure 3 is a plot of the local SHAP values across all test samples for the classification model, which combined creates a global explanation of how the classification model works for all predictions. To avoid showing ten plots (one for each test fold of the ten-fold cross validation), we showcase the training and test set of the ten-fold cross validation that gave prediction errors closest to the mean results of all ten repetitions. The figure is limited to the 20 variables with the highest sum of absolute SHAP values across the test set; j |, which is an indication of the importance of that variable to the model. The variables are displayed decreasing in importance from the top. An increase in SHAP value (towards the right on the x-axis) contributes to a higher probability of slippery conditions, and negative SHAP values contribute to lowering the probability. Note that when the scatter points do not fit on a line, they pile up to show density, and the color of each point represents the variable value of that individual point. SHAP values are given as a deviation from the expected value of the response E[f (x)], which would be predicted if we do not condition on any variables. This means that a SHAP value of 0 indicates that including that variable would not influence the prediction at all. In Figure 3, the SHAP values are given as deviation in every instance's scores obtained from all the trees of the model, which is the value given before taking the logistic function in Eq. 12.
One first observation from Figure 3 is that depth of contamination is important for our model, and that the deeper the contamination, the higher the probability of slippery conditions. This corresponds to the fact that a higher amount of accumulated dry snow also contributes to more slippery conditions. Other factors that increase the probability of slipperiness is cold runway temperature, high relative humidity, and high precipitation intensity. These are all known factors that cause difficult landing conditions. One less intuitive result is that the presence of damp and wet contaminated runways make it less slippery. These most probably becomes surrogate variables explaining that there is no snow or ice on the runway, which often create more slippery conditions than just wet and damp runway. We also see that there is a difference between the two airport runways at Oslo Airport, that one seems to be more slippery than the other. When regarding the time of the observations, several variables with time difference up to 24 hours are important, as well as 1, 3 and 6 hours. It seems that the long-term effect of these variables affects the runway conditions, and that it is necessary to include such a wide timespan.
One interesting effect is that the presence of sand makes the model increase the probability of a landing to be slippery, even though the intention of sanding is the opposite. Since the runway operators only add sand to the runways in the presence of slippery conditions, XGBoost might use the presence of sand as a surrogate variable explaining slippery conditions caused by ice or snow on the runways. In addition, even though sanding can increase tire-pavement friction, especially when applied on solid contamination, it can also make it difficult to achieve the high levels of friction [32]. The results from the XGBoost models and the SHAP values are entering the discussions around sanding of airport runways and might indicate that sanding is not always helpful in lowering the slipperiness.
Another observation is that horizontal visibility is an important variable. It is not intuitive why this should be an important variable for the experienced slipperiness for landing aircrafts, even though it of course affects the visual perception for the pilots. Sometimes it also indicates heavy precipitation. As Oslo Airport is located at a place where it is quite often foggy, a low horizontal visibility can be an indication of fog. Fog combined with cold weather conditions might lead to very slippery and dangerous runway conditions. This is also the reason why fog is involved in as much as two of Huseby & Rabbes' [20] eight slippery scenarios, namely Freezing fog and Stratus/fog, air temperature below 0 • C. The former scenario happens when the temperature on the ground level drop to or below freezing point and the water droplets making up fog freeze on contact. This can result in black ice, which makes the runway very slippery.
As observed in the SHAP values for the classification model, the strength of along wind and across wind are part of the influential variables. These variables do not directly affect the available friction between the tires and the runway, but it does affect the necessary braking force for the landing aircraft. Along wind contributes with either a stopping force or pushing force dependent on the direction, and either increases or decreases the necessary force from the brakes of the aircraft. This could affect whether a landing is friction limited or not, as the pilot might have to brake harder. Across wind also contributes to difficulty in maneuvering and using more of the available friction on steering instead of braking. Therefore, you could say that our model works more like a landing condition predictor than a runway surface predictor, since it includes the overall experienced landing conditions for the aircrafts. This means that the model could also work for other kind of challenging landing conditions than snow and ice, e.g. wind and rain, and thereby be expanded to not-northern airports. Figure 4 shows the global SHAP values for the XG-Boost regression model. The SHAP valus are given as deviation in the friction coefficient, and higher SHAP values corresponds to a higher friction coefficient, meaning less slippery conditions. The figure shows that the XGBoost regression model mostly uses the same variables as the classification model, but that the sequence has changed. For this model, the accumulated dry snow the last 24 hour is the most important factor, with contamination depth as the second. Some variables has increased their contribution significantly compared to the classification model. The predicted friction coefficient lowers with a low horizontal visibility, high dew point, and an increase in air pressure. We also see that the difference between the two airport runways are larger for the regression model than for the classification model. One observation from the SHAP values for the regression model is that the effect of the along wind is opposite than for the classification model. Here stronger positive along wind contributes to an increase in the friction coefficient, even though it should not directly affect this, as the friction coefficient is a ratio of the frictional force between the tires and the runway. This counter-intuitive behaviour most probably comes from the calculation of the friction coefficient in the performance models mentioned in Section 2.2, where along wind has a relationship with several of the variables that affect the estimation of the friction coefficient. As these relationships and effects are complex, we are not going to discuss it further in this paper. But XGBoost seems to pick up on these relationships and use it, even though it might seem illogical when just looking at the along wind variable isolated.

Creating autonomous models without Snowtam reports
We have shown that the XGBoost models manage to predict the experienced runway surface conditions better than the Snowtam reports created by the runway inspectors, both when classifying slippery / non-slippery conditions and when categorizing how slippery it is for the friction limited landings. Another point of interest is to check how good the models could work on their own without any human influence, to be a totally autonomous system. This means only using data from the sensor variables (the meteorological data / weather data), and not the human assessments in the Snowtam reports. Table 8 shows a comparison of the XGBoost classification model and regression model including and excluding variables from the Snowtam reports, where X tot is the total dataset of meteorological and Snowtam data and X met is the subset of only meteorological data. For both the classification and regression model, there is a small decrease in performance when excluding data from the Snowtam reports. But as the difference in performance is relatively small, we see that the models work quite well without information from the Snowtam reports. Even though we lose information about runway contamination and maintenance procedures, the XGBoost models seem to find other ways of describing most of this information. This was substantiated by looking at the global SHAP plots for the models trained without the Snowtam reports, where especially accumulated dry snow, wet snow, and rain had significantly increased their contribution to the models, as well as runway temperature. The models were earlier dominated by contamination depth and type, but now the models use accumulated precipitation and runway temperature to explain the probable type and depth of contamination on the runways.

Local explanations and the decision support system
In high-risk applications such as air transportation, taking well informed and safe decisions is of main importance. This has increased the focus on creating data-driven decision support systems for crucial tasks within the aviation industry in the recent years. Examples of this are decision support systems for tasks such as aircraft maintenance planning [47,10], aircraft dispatch assessment [35], aircraft trajectory anomalies prediction [11], runway conflict prevention [37], and aviation incident risk prediction [64]. The decision support systems can support in safer, more standardized, and more economic operations of several parts of the air transportation industry.
Combining the XGBoost prediction models with SHAP local explanations can create a solid framework for a decision support system for runway conditions, to contribute to safer airplane landings and take-off. In section 4.4, we plotted all local SHAP values together to create global explanations of the models. This is a strong tool to get some understanding of how the models work. However, just looking at the single local explanation for a prediction can be just as useful. For an user of an artificial intelligence system, only getting the final prediction might not be as helpful in itself, as it is difficult to trust a decision without any arguments. SHAP values give local explanations of every prediction, meaning we can get information about why the surface conditions were predicted as they were at all times. Figure 5 shows an example of local SHAP values for a prediction from the XGBoost regression model of the runway friction coefficient. This prediction happened at the west runway at Oslo Airports at 8th February 2018, 22:23. The predicted friction coefficient of 0.1198 is lower than the expectation value. The main reason why XGBoost predicts this level of slipperiness is the presence of dry snow on ice with a depth of 8 mm, which is given both as the absence of wet runways and the presence of dry snow on ice. This splitting of importance happens because of the one-hot encoding done it Section 2.3. The absolute air temperature is almost zero, which can create quite difficult surface conditions. The horizontal visibility is quite low, so either there is fog or heavy precipitation. We can also see that there was precipitation quite recently. One factor that makes it less slippery is that the prediction is at the west runway (Airport Runway 1) and not east, which seems to in general provide better landing conditions. Another factor is the low dew point, which is way below the air temperature, so at least potential fog or air moisture will not condense and freeze on the runway.
An illustration of a decision support system created based on the output from both the XGBoost prediction models and the SHAP local explanations is shown in Figure 6. Module 1 and 2 are the predictions from the classification model, where it is slippery if the probability of slippery conditions is higher than 50%. The probabilities are scaled to transform the expectation value of 0.0247 to 50%, to match the threshold used in Table 5. Module 4 is the output from the regression model, converted to braking action according to Table 1. Module 5 and 6 are outputs from the local explanations, which shows arguments for the prediction. The ten variables with highest SHAP value magnitude are given as arguments, where only up to five positive and five negative arguments are shown. The system shows the explanation of the classification model if the probability of slippery conditions is below 50% (non-slippery conditions), and shows the output from the explanations of the regression model if it is above 50% (slippery conditions). This way we make sure that the explanations focus on output from the model that is most trained within the given range. Module 3 is an additional feature to provide even more information and is an implementation of Huseby & Rabbes' [20] scenario model, to provide more transparency by giving information about any potential slippery scenarios.
The illustration gives the same example as Figure 5. At this time XGBoost classify the runway conditions as Slippery with a 60% probability, and that the braking action is medium. The user can see that the main reason for this level of slipperiness is because of dry snow on ice with a depth of 8mm, and that the air temperature is almost zero and the horizontal visibility is low. The SHAP values are given as text, as this is easier to comprehend for the end users. The system should have two separate interfaces, one for each runway, where the variable airport runway will be used to divide the predictions into two.
Including local explanations of the predictions provides a much more useful decision support system than only the prediction on its own, especially in critical systems as risk management, where trust is a crucial issue [30]. When using the decision support system in Figure 6, the airport operators not only trust the system's decision seeing its decision basis, they can also check that the sensors and models work properly, and they can know what maintenance procedures to carry out to make it less slippery. The pilots can also be given information about what makes the landing conditions difficult and take this into consideration when planning a landing strategy.
There are some considerations to take into account when working with explainable artificial intelligence. One important point is multicollinearity between explanatory variables, which is a highly discussed problem in most model interpreting methods, as many of them assume independence between the variables [1,15,43,62]. Both XGBoost and interventional SHAP values are robust to multicollinearity and including a lot of related variables should not affect their performance [28,50,8]. One thing to bear in mind is that the interventional approach to SHAP is faithful to the prediction model, giving explanations of how the model works, not how the explanatory variables are connected to the response. One strength of XGBoost compared to other tree ensemble methods such as Random Forest, is that XGBoost in a much higher manner splits only on the most important variable in a group of highly correlated variables, not alternating between them. This means that variables which are highly correlated with the most relevant variables might get a low importance, even though they could be highly related to the response. Since interventional SHAP are faithful to the model, the SHAP values in Figure 3, 4, 5 and 6 must be considered to explain how the XGBoost models work, not how the explanatory variables are related to the response.

Conclusions and future work
This paper presents a machine learning framework for providing real-time decision support for the assessment of airport runway conditions. This decision support sys-tem addresses the real-world problem within the aviation industry of efficiency and safety, which follows from the expected increased demand of air transportation.
The developed decision support system uses XGBoost to predict airport runway conditions, where the prediction models consist of a classification model to predict slippery conditions and a regression model to predict the level of slipperiness. The models are trained using weather data and runway reports and predict the runway conditions represented by the friction coefficient estimated using sensor data from landing aircrafts. The performance of the XG-Boost models are compared to the state-of-the-art runway model and scenario model, as well as reported runway assessments from airport inspectors.
The XGBoost models achieve a high performance and outperform all the previous methods. This shows the strong abilities of machine learning to find and use patterns to model complex, physical phenomena when domain knowledge is included through the extraction of explanatory variables. An increased accuracy in the prediction of runway assessment can aid airport operators and pilots in making more appropriate decisions, which can contribute to avoiding accidents and lead to safer airplane landings.
The prediction models are combined with SHAP approximations to create interpretable models which can provide even more useful information. Combining the SHAP values with the prediction models provides a high accuracy and trustworthy decision support system, which presents arguments for the predicted slipperiness of the runway instead of only the prediction. In addition to contributing to safer and more economic operations of airport runways, providing trustworthy information about runway conditions can also contribute to lower fuel usage and less use of chemicals. If the runway conditions are known to be good, the pilots can use less fuel on thrust reverse, and the operators can use less anti-icing and de-icing chemicals on the runway. Future work will be to expand the prediction system into a forecasting system to predict the runway conditions some hours into the future, by using time series and weather forecasts. This could help the airport operators Figure 6: An illustration of a decision support system for airport runway conditions using the output from our models. Module 1 and 2 are the output from the classification model, module 3 is the output from the scenario model, model 4 is the output from the regression model, and module 5 and 6 are the output from the local explanations.
to plan and execute necessary runway maintenance procedures, and in the logistic of airplane landings and take-off. We are also going to work with methods involving censored data from survival analysis to take advantage of the measurements of minimum available friction from the aircraft landings that are not friction limited. To extend the system to other airports than Oslo Airport, it could be of interest to work with transfer learning methods, instead of training completely separate models for all airports.