Predicting Football Team Performance with Explainable AI: Leveraging SHAP to Identify Key Team-Level Performance Metrics

: Understanding the performance indicators that contribute to the ﬁnal score of a football match is crucial for directing the training process towards speciﬁc goals. This paper presents a pipeline for identifying key team-level performance variables in football using explainable ML techniques. The input data includes various team-speciﬁc features such as ball possession and pass behaviors, with the target output being the average scoring performance of each team over a season. The pipeline includes data preprocessing, sequential forward feature selection, model training, prediction, and explainability using SHapley Additive exPlanations (SHAP). Results show that 14 variables have the greatest contribution to the outcome of a match, with 12 having a positive effect and 2 having a negative effect. The study also identiﬁed the importance of certain performance indicators, such as shots, chances, passing, and ball possession, to the ﬁnal score. This pipeline provides valuable insights for coaches and sports analysts to understand which aspects of a team’s performance need improvement and enable targeted interventions to improve performance. The use of explainable ML techniques allows for a deeper understanding of the factors contributing to the predicted average team score performance.


Introduction
Artificial intelligence (AI) and specifically machine learning (ML) are quickly becoming popular methods for predicting the average scoring performance of European football teams [1]. This is because the technical data collected during football matches can provide valuable insights into a team's playing style and tactics [2]. By analyzing this data, coaches and analysts can gain a deeper understanding of a team's strengths and weaknesses, and use this information to make more informed decisions about player recruitment and opposition analysis [3].
One of the key challenges in analyzing this data is that it comes in a variety of forms, including match sheets, ball events, and tracking data [4,5]. These data types differ in their granularity and availability, but data collection companies are increasingly annotating more types of events and providing information about each event [6]. To effectively analyze team behavior, it is necessary to summarize its playing style in a way that is both humanly interpretable and suitable for data analysis [7]. This typically involves constructing a "fingerprint" of the team's behavior, capturing characteristics such as the types of actions they tend to perform and the types of gameplay patterns the team's players participate Moreover, one of the major challenges with using AI in sports performance analysis is the lack of transparency and interpretability of the results. Traditional AI models, such as neural networks, can be difficult to understand and interpret, making it hard to explain how and why a particular decision or prediction was made. This is where explainable AI (XAI) comes in. Explainable AI is a subfield of AI that focuses on creating models that can provide clear and interpretable explanations for their predictions and decisions [27]. One of the most popular methods for explainable AI is SHapley Additive exPlanations (SHAP), which is a unified method for interpreting the predictions of any machine learning model [28]. It is based on the concept of Shapley values from cooperative game theory, which provides a way to fairly distribute a value among a group of individuals, such as the players on a football team. Recently, two studies [29,30] have employed SHAP as a post-hoc explainability tool to evaluate the impact of each feature on the final outcome, specifically in the context of match-specific score prediction. Both studies primarily concentrated on individual match predictions and were limited to data from a single league, utilizing a moderately sized feature set.
The objective of this paper is to use XAI to identify the key team-level performance metrics that are most important in predicting a football team's performance. By concentrating on overall performance and identifying critical parameters influencing scoring performance (average goal difference over the season), this paper seeks to offer a broader understanding of the factors that determine success in football. The use of an expansive dataset from multiple European leagues, combined with an extensive set of features, sets this study apart from previous research and enhances its potential to uncover novel insights and patterns in football performance. This approach allows for a more transparent and interpretable understanding of how a machine learning model is making its predictions, which can be particularly useful in high-stakes decision-making scenarios such as predicting a team's performance in a football match. By leveraging explainable AI, coaches and analysts can gain a deeper understanding of team performance and make more informed decisions about player recruitment and opposition analysis.
The structure of the paper is as follows: In Section 2, the proposed methodology and characteristics of the dataset are presented. The results of the study, including the performance of the proposed ML model and the explanations generated at both the global and local levels, are discussed in Section 3. The implications and limitations of the study are discussed in Section 4, and conclusions are presented in Section 5.

Materials and Methods
In this article, we aim to predict the average goal difference of football teams over an entire season using machine learning and subsequently explain the predictions made by the model. Our input data consists of various team-specific features such as ball possession and pass behaviors. The target output is the average scoring performance (goal difference) of each football team over the season. The proposed pipeline ( Figure 1) includes three main steps: (1) data preprocessing, (2) model training and prediction, and (3) explainability. We first preprocess and clean the data to ensure that it is suitable for training and testing our model. Next, we use XGBoost, a powerful and widely used machine learning algorithm, to train the model and make predictions. Finally, to ensure the explainability of the predictions, we will use SHAP to provide an explanation for each prediction at a team-specific or overall level. This will allow us to understand the factors that drive the prediction and the contribution of each feature to the predicted average team score performance.

Dataset
Our dataset includes all matches played during the regular season of the top division in 11 European countries for the 2021-2022 season (Table 1). For each match, data were recorded for both teams, resulting in a total of 5992 observations. However, data was unavailable or incomplete for eight matches, as recorded by Instatscout (https://football.instatscout.com/ (accessed on 20 June 2022)).
Specifically, this study involved collecting 160 variables, either directly through In-statScout or indirectly calculated by the authors using data from this platform. These variables were recorded in a Microsoft Excel spreadsheet (full description of the variables is given in the Appendix A). Prior research has shown that the indicators obtained through Instat Scout have high reliability, with K values ranging from 0.90 to 0.98, as per studies by Casal et al. (2019), Castellano and Echeazarra (2019) and Gómez et al. (2018) [31][32][33].

Data Pre-Processing
To ensure the quality and consistency of our dataset, we performed several pre-processing steps before conducting the analysis: (1) Data cleaning: We first cleaned the raw data by removing any missing values or erroneous records to ensure the accuracy and

Dataset
Our dataset includes all matches played during the regular season of the top division in 11 European countries for the 2021-2022 season (Table 1). For each match, data were recorded for both teams, resulting in a total of 5992 observations. However, data was unavailable or incomplete for eight matches, as recorded by Instatscout (https://football. instatscout.com/ (accessed on 20 June 2022)).
Specifically, this study involved collecting 160 variables, either directly through In-statScout or indirectly calculated by the authors using data from this platform. These variables were recorded in a Microsoft Excel spreadsheet (full description of the variables is given in the Appendix A). Prior research has shown that the indicators obtained through Instat Scout have high reliability, with K values ranging from 0.90 to 0.98, as per studies by Casal et al. (2019), Castellano and Echeazarra (2019) and Gómez et al. (2018) [31][32][33].

Data Pre-Processing
To ensure the quality and consistency of our dataset, we performed several preprocessing steps before conducting the analysis: (1) Data cleaning: We first cleaned the raw data by removing any missing values or erroneous records to ensure the accuracy and reliability of the dataset. (2) Averaging variables: All variables, including the output variable, were averaged over the course of the season to provide a general representation of the team's performance throughout the year. For instance, an average team score performance of +2.1 indicates that on average, the team scored 2.1 more goals than the goals they conceded. (3) Feature scaling: To ensure that all of the features are on the same scale Feature scaling was implemented, using the StandardScaler library [34],. This is important for many ML algorithms, as it can help prevent one feature from dominating the others during the training process.

Machine Learning
Just prior to the model training process, we applied a feature selection technique, sequential forward selection, to identify the most important features for the task at hand [35]. This is a wrapper-based feature selection method, where we used XGBoost as the model and R-squared as the selection criterion. The algorithm iteratively adds features to the model one by one and evaluates their impact on its performance. This helps us identify the most relevant features that contribute the most to the model's accuracy and avoid overfitting.
Once the relevant features have been selected, we trained and tested an XGBoost regressor on the data by using a ten-fold cross-validation strategy and internal hyperparameter tuning in the training phase [36]. XGBoost is a powerful gradient-boosting algorithm that has been shown to perform well on a wide range of tasks. We also used the SHAP model to understand the contribution of each feature to the global and local predictions. The performance of the proposed model was compared to that of three other well-known regression algorithms: Support Vector Regression (SVR) [37], Random Forest (RF) [38], and the k-Nearest Neighbor Regressor (kNN) [39].

Explainability
In order to understand and explain the predictions made by our machine learning model, we used the SHAP library [28,40]. SHAP values provide a unified measure of feature importance that can be used for both linear and non-linear models. SHAP is a powerful and unified measure for interpreting the output of machine learning models, offering a consistent approach to understanding the impact of features on model predictions. SHAP values are derived from cooperative game theory and provide an interpretable allocation of each feature's contribution to a prediction, while ensuring that the sum of all feature attributions equals the difference between the predicted outcome and the average baseline prediction. This approach allows for a fair distribution of each feature's influence on the prediction, accounting for potential interactions and dependencies among features. In our study, we employ SHAP as a post-hoc explainability tool to quantify the effects of each feature on the final outcome, helping us identify the key parameters that contribute to a team's overall scoring performance. For a given prediction, SHAP values attribute a contribution value to each feature, with positive values indicating that the feature pushed the prediction higher and negative values indicating that the feature pushed the prediction lower. This allows us to understand how each feature contributed to the final prediction and how they compare to one another. Overall, the use of SHAP values provides a detailed, accurate, and easily interpretable explanation of the inner workings of our regression model.
In addition to the SHAP values employed in our study, there are other explainable AI methods, such as Local Interpretable Model-agnostic Explanations (LIME) [41], that can be used to provide insights into the importance of features in complex models. LIME is a popular technique that explains individual predictions by fitting a locally interpretable model around the specific data point. While both SHAP and LIME aim to increase the interpretability of machine learning models, they differ in their approach. SHAP values are grounded in cooperative game theory and provide a unified measure of feature importance that is both locally and globally accurate. In comparison, LIME focuses on local interpretability and may not provide the same level of global accuracy. Additionally, SHAP values maintain consistency, which means that the order of feature importance will remain the same across different models, while LIME does not guarantee this property. For our study, we chose to use SHAP values because they provide a more consistent and accurate Future Internet 2023, 15, 174 6 of 18 measure of feature importance. However, future work could explore the use of LIME or other XAI methods to analyze football team performance and compare the resulting insights with our findings.

Results
This section presents the results of the proposed explainable machine learning pipeline, including the explanations generated by the SHAP algorithm, which provides insight into the factors that influence the model's predictions.
Based on the sequential forward selection method, we identified 141 out of the 159 initial variables as the most relevant features for predicting football team performance. The selected variables are listed in Appendix A, and the importance of the top 15 variables is visualized in Figure 2. By using these 141 features, our model was able to achieve a satisfactory performance in terms of both accuracy and interpretability. Table 2 and Figure 3 present the results of the model's performance in predicting the average team score over a year. The scatter plot in Figure 3 compares the actual values (x-axis) with the predicted values (y-axis). Each point in the scatter plot represents a team, with the x-coordinate denoting the actual averaged team score performance and the ycoordinate denoting the predicted averaged team score performance. The line of best fit is a visual representation of how closely the predictions align with the actual results, with a slope of 1 indicating a perfect fit. The distribution of the points around the line of best fit demonstrates the accuracy and balance of the predictions. Additionally, the reported results in Table 2   The next step in analyzing the model's performance is to examine its explainability. This analysis aims to understand the factors that influence the model's predictions and how they relate to the actual outcome (the team's average score performance). By understanding the underlying relationships and patterns, we can gain insight into the behavior of the model and identify areas for improvement (modifiable key team-level performance metrics). This can also provide valuable information for stakeholders (e.g., coaches, sport analysts) to understand the decision-making process of the model and the rationale behind its predictions.  The next step in analyzing the model's performance is to examine its explainability. This analysis aims to understand the factors that influence the model's predictions and how they relate to the actual outcome (the team's average score performance). By understanding the underlying relationships and patterns, we can gain insight into the behavior of the model and identify areas for improvement (modifiable key team-level performance metrics). This can also provide valuable information for stakeholders (e.g., coaches, sport analysts) to understand the decision-making process of the model and the rationale behind its predictions. The next step in analyzing the model's performance is to examine its explainability. This analysis aims to understand the factors that influence the model's predictions and how they relate to the actual outcome (the team's average score performance). By understanding the underlying relationships and patterns, we can gain insight into the behavior of the model and identify areas for improvement (modifiable key team-level performance metrics). This can also provide valuable information for stakeholders (e.g., coaches, sport analysts) to understand the decision-making process of the model and the rationale behind its predictions.   Figure 3. As depicted, variables such as shots per possession percentage, missed chances, entries into the penalty box, conversion percentage of chances, and passes have a positive impact on the team's predicted score performance. Conversely, variables such as lost balls in the team's own half and the ratio of dribbles per minute of possession have a negative effect on the score, indicating that an increase in these variables leads to a decrease in the team's score.
Local explanations (team-specific): Figures 4-7 are actual SHAP force plots that allow us to see how the different variables contributed to the model's prediction f(x) for specific teams. The higher the score, the more the model is likely to predict a positive outcome (good score performance), and the lower the score, the more the model is likely to predict a negative outcome (bad score performance). The variables that were important to making the prediction for this team are shown in red and blue, with red representing features that pushed the score higher, and blue representing features that pushed the score lower. The features that had more of an impact on the score are placed higher, and the size of that impact is represented by the size of the bar.
In the case of Liverpool FC, all the variables pushed the score higher (as indicated by the red bars), indicating that they are important for the model's prediction of a positive outcome. Similar findings were obtained for Manchester City FC, where the team performed well in all key team-level performance variables. On the other hand, using SHAP force plots, it is possible to identify which variables have a negative effect on the team's performance. For example, four key variables (shots per quantity of possession percent, chances percent of conversion, accurate passes, and high pressing percent) were identified as negatively impacting the scoring performance (average goal difference) of West Ham FC. Similarly, lost balls in their own half, offsides, and corners were identified as key performance variables for Lazio FC that have a negative effect on the scoring performance and would require improvement. In summary, SHAP force plots allow stakeholders such as coaches or sports analysts to see which aspects of a specific team's game performance are satisfactory and which need improvement, enabling targeted interventions and adjustments to be made to improve the team's performance.

Discussion
Recognizing the performance indicators that contribute to the final score of a match is important in order to direct the training process toward specific goals. Consequently, the purpose of the current study was to identify and measure the contribution of each performance indicator to the final score of a match. We managed: (i) to predict the goal difference between teams in a match and (ii) to identify the contribution of each performance indicator to the match score both for the teams as a whole and for each team individually. The results showed that for the teams as a whole, fourteen variables had the greatest contribution to the outcome of the match. Of these, twelve (shots per quantity of

Discussion
Recognizing the performance indicators that contribute to the final score of a match is important in order to direct the training process toward specific goals. Consequently, the purpose of the current study was to identify and measure the contribution of each performance indicator to the final score of a match. We managed: (i) to predict the goal difference between teams in a match and (ii) to identify the contribution of each performance indicator to the match score both for the teams as a whole and for each team individually. The results showed that for the teams as a whole, fourteen variables had the greatest contribution to the outcome of the match. Of these, twelve (shots per quantity of possession percent, missed chances, entrance to the penalty box, chances percent of conversion, key passes accurate, passes, key passes, accurate passes, ratio passes per lost balls, high pressing percent, positional attacks with shots, sum duration of ball possession) had a positive effect, while two (lost balls in own half, ratio dribbles per minute of ball possession) had a negative effect. When we looked at each team separately, the variables that contribute the most to shaping the scores in their matches differ.
Shots per quantity of possession percent is the variable with the biggest contribution. In addition, among the fourteen most important performance indicators is the variable positional attacks with shots. Both of the above variables show that the ability of teams to make shots has a significant positive contribution to the final score in their favor. This finding is in agreement with other research that showed that the total number of shots made by a team is an important factor in determining the match outcome [42][43][44], but also with the research of Castellano et al. (2012) which showed that successful teams make more shots [45].
However, besides the shots, there are also other variables that contribute positively to the final score of the match. Firstly, our research showed that the creation of chances, even if they are lost (chances missed), but also the ability to convert the chances into goals (chances percent of conversion) had a significant positive contribution. Although chances are the factor that determines the variable xG [46], only one study conducted on beach soccer [47] has examined their effect on match score and found that chance creation is a factor that can distinguish winners from defeated teams. Secondly, four variables related to passing and ball possession (passes, passes accurate, ratio passes per lost balls, sum duration with ball possession) are among the fourteen most important. This finding confirms almost all previous research that has examined the contribution of ball possession to match outcome [14,[42][43][44][48][49][50]. On the contrary, the research of Harrop and Nevill (2014) showed that only successful passes help distinguish the games that a team wins [51], while total passes showed the opposite. However, it should be pointed out that this research was carried out with data that concerned only one team. Thirdly, entrance to the penalty box is another variable that we found to significantly contribute to a positive score in a match and this is in agreement with research that had similar objectives [52]. Finally, key passes and high pressing percent (high pressing success rate) have not been examined by relevant research for their contribution to the match outcome. However, other studies showed the importance of key passes to the playing effectiveness of a team [53][54][55], but also the usefulness of a successful high pressing because defending near the opponent's goal seems to be associated with success in soccer [56][57][58].
On the other hand, among the fourteen variables that affect the outcome of the match, there are two variables (ratio dribbles per min of possession, lost balls in own half) that have a negative contribution. Liu et al. (2015) and Harrop and Nevill (2014) had already shown that dribbles had clearly negative effects on the probability of winning [43,51], which agrees with our own finding. The variable lost balls in own half has not been considered in research investigating the contribution of performance indicators to the match outcome. However, both among coaches and in the scientific literature, it is commonly accepted that the closer to the rival goal the start of the offensive action, the greater the probability of success in ball possessions [59][60][61].
In addition to applying our methods to all teams as a whole, we also applied them to some teams separately. In these cases, there were differences in the fourteen variables that had a greater contribution to shaping the outcome of their matches depending on the philosophy of their coach and the tactical principles they adopted. For example, Liverpool manager Jurgen Klopp's preference for the "high press" is well known [62][63][64]. This is reflected in the results of our research, since three of the fourteen variables (high pressing percent, ball recoveries in opponent's half, ratio defensive challenges attacking 3rd plus defensive challenges midfield 3rd per defensive challenges) for Liverpool are related to this particular philosophy, while for the teams as a whole only one of them appeared.
On the other hand, the style of Guardiola's teams (tiki-taka) is characterized by high percentages of possession with many and short passes [65][66][67]. The results of our research showed that in Pep's team (Manchester City) five of the fourteen variables that have the greatest contribution to the final score of the matches are related to this style of play (passes, accurate passes, sum duration with ball possession, ratio passes per lost balls, ball possession percent). We looked at one more English Premier League team (West Ham). When they attempted to press their opponent high, they usually did so in a 4-2-4 formation and the players in the front four line often had long distances from the remaining six players. This made them vulnerable in the given situations. This particular observation was made after a qualitative analysis of West Ham's games by one of our authors, who is a certified soccer performance analyst. The results of our quantitative research confirm this particular observation, since although in the teams as a whole the "high pressing percentage" variable contributes positively to the result, in West Ham it contributes negatively.
In addition to the three English teams, we also looked at an Italian team (Lazio) whose manager (Maurizio Sarri) has given his name to a style of play called Sarribal [68]. Sarribal is characterized by persistence in building up from the back even if the opponent presses high with many players. That is, he uses many small passes in the defensive half with the aim of drawing the opponent high. When this is done, the players are instructed to make vertical forward passes to the back of the opposing defensive line. The results of our study fully reflect this specific style of play. In particular, (a) many short passes increase the number of passes, accurate passes and ball possession percentage, (b) the persistence in the build up and the big number of passes in the team's half can increase the opponent's recoveries closer to the team's goal (lost balls in own half), (c) the vertical passes are often key passes that increase the number of final actions (shots per quantity of possession percentage), while (d) the movements attempted by players at the back of the opposing defensive line (to receive vertical passes) can also increase offsides.
In this paper, we presented a pipeline for predicting the average team score performance of football teams using machine learning, data preprocessing, and explainability. However, there are certain limitations to the study that should be acknowledged. First, the data used in this study is limited to one season, the 2021-2022 season, which may not fully capture the dynamics and complexities of team performance over time. Additionally, while our prediction is focused on the average team score performance over the year, it is not able to predict individual team score performance per match, as this prediction would not have a good performance. This limits the scope of the study and the potential applications of the proposed pipeline. To improve the model's performance and to provide more robust predictions, it would be beneficial to gather data from multiple seasons, and also work on predicting individual match scores' performances.
In addition to the proposed sequential forward selection technique, there are other robust feature selection techniques, such as BORUTA, which is a wrapper-based method built around the random forest algorithm [69]. BORUTA iteratively compares the importance of features to that of shadow features, which are shuffled copies of the original features, to determine their relevance. While BORUTA is considered more robust and can handle non-linear relationships better than sequential forward selection, it may be computationally more expensive. We acknowledge that comparing different feature selection methods could provide further insights into the best approach for our specific task. Future work could investigate the performance of BORUTA and other feature selection techniques in the context of predicting football team performance.
Finally, while our model's primary objective is to understand the importance of various team-level performance metrics within the current season, we acknowledge that the pipeline does not predict future performance. The input and output are simultaneous in time, which means that the model cannot be used as a predictor for subsequent seasons. Future work could explore the possibility of incorporating lagged variables or historical data to enable predictions for upcoming seasons. However, our current approach still provides valuable insights into the factors that contribute to a team's performance, helping stakeholders make informed decisions based on these insights.

Conclusions
This paper aimed to identify and measure the contribution of various performance indicators to the final score of a football match. Through the use of explainable machine learning techniques, we were able to identify the contribution of each team-level performance indicator to the match score for all teams as a whole and for each team individually. The results provided valuable insights into which performance indicators had the greatest impact on the outcome of a match. This information can be used by coaches and sports analysts to make targeted interventions and adjustments to improve the performance of teams. It is important to note that the results of this study are based on data from one season and are not able to predict individual match scores, which are limitations that should be considered when interpreting the findings. Despite this, the study provides a useful framework for understanding the key factors that contribute to a team's performance and can be applied to future research using data from multiple seasons.  Table A1. Full list of variables used in our analysis.

Sum_long_passes
Passes with a length of at least 40 m, regardless of the area from which they were made Pass_long_def_3rd Passes made in the defensive third that were at least 40 m long Pass_long_mid_3rd Passes made in the midfield third that were at least 40 m long Pass_long_att_3rd Passes made in the attacking third that were at least 40 m long RATIO_long_passes_PER_passes Passes with a length of at least 40 m/total number of passes Defensive_challenges Duels involving the players of the defending team

Def_challenges_def_3rd
Duels involving the players of the defending team and taking place in the defensive third of that team Def_challenges_mid_3rd Duels involving the players of the defending team and taking place in the midfield third of that team Def_challenges_att_3rd Duels involving the players of the defending team and taking place in the attacking third of that team Air_challenges Duels in which the ball is above shoulder height and players try to play with their heads Air_challenges_won Successful air challenges Air_challenges_missed Unsuccessful air challenges Air_challenges_won__percent Air challenges won/air challenges (%) Air_challenges_def_3rd Air challenges in the team's defensive third Air_challenges_mid_3rd Air challenges in the team's midfield third Air_challenges_att_3rd Air challenges in the team's attacking third Challenges Total number of duels Duels involving the players of the defending team and taking place in the defensive third of that team/total duels involving the players of the defending team

RATIO_def_challenges_mid_3rd_PER_defensive_challenges
Duels involving the players of the defending team and taking place in the midfield third of that team/total duels involving the players of the defending team

RATIO_def_challenges_att_3rd_PER_defensive_challenges
Duels involving the players of the defending team and taking place in the attacking third of that team/total duels involving the players of the defending team

RATIO_def_challenges_att_3rd__def_chall_mid_3rd_PER_defensive_c
Duels involving the players of the defending team and taking place in the midfield and attacking third of that team/total duels involving the players of the defending team

DIFFERENCE_air_challenges_att_3rd_MINUS_air_challenges_def_3rd
Air challenges in the team's attacking third minus air challenges in the team's defensive third

RATIO_air_challenges_att_3rd___air_challenges_def_3rd_PER_air_c
Duels involving the players of the defending team and taking place in the defensive and attacking third of that team/total duels involving the players of the defending team Chances A goal-scoring opportunity Missed_chances A goal-scoring opportunity which did not result in a goal

Fouls
An action that is not compatible with the rules of the game and is used to stop the progress of the opponent's attack Yellow_cards An illegal action punishable by a yellow card from the referee  Average passes per minute of possession AVERAGE_passes_PER_ball_possession Average passes per possession Ball_possessions__quantity The number of ball possessions Average_duration_of_ball_possession_sec The average duration of each ball possession Sum_duration_with_ball_possession The total duration of possession for a team Ball_possession__percent The percentage of ball possession for a team Opponent_s_ball_possession_percent The