Empirical Comparisons for Combining Balancing and Feature Selection Strategies for Characterizing Football Players Using FIFA Video Game System

The process of modelling individual player performance using machine learning is a mature task in sports analytics. The most significant challenges in machine learning include class imbalance and high dimensionality problems. We conducted a comprehensive literature review and observed that both the issues have been studied independently. We found that feature selection addresses the dimensionality reduction problem by determining a subset of relevant features, while data sampling seeks to make the data more balanced by adding or removing instances. We also found out that efforts have been taken for studying the effect of the joint use of feature selection and balancing techniques. However, the prioritization of the feature selection and sampling is still difficult, and the relationship between them remains unclear. This paper presents a large-scale comparison of characterizing football players into nine positions by using FIFA video game data, whereas most of the previous studies in this field have focused on characterizing players into only three classes according to their positions. The proposed methodology for the study consists of three main steps. In the first step, the sampling technique is applied to deal with class imbalance, while the second step encompasses the feature selection technique, which deals with the high dimensionality problem. The third step combines feature selection and data sampling to deal with both the issues. We made the comparisons based on nine feature selection algorithms and three balancing techniques, and then we evaluated their performance using the random forest classifier. We found that 1) feature selection techniques did not improve the accuracy of the baseline model, 2) balancing techniques improved the accuracy compared to the baseline, and 3) the results showed superiority of the proposed methodology, involving the joint application of resampling and feature selection with data balanced by the random oversampling (ROS) method and synthetic minority oversampling technique (SMOTE), compared to the results obtained only through the use of a single technique and from the original imbalanced training set. Overall, the proposed methodology improved prediction accuracy compared to the baseline model. Moreover, the methodology provided a significant decrease in the number of features, from 29 to 10 features on average.


I. INTRODUCTION
Football is regarded as the most popular sport in the world in the number of both spectators and players [1]. The popularity of football has increased in the last few years, making it an essential contributor to the global economy [2]. In fact, the revenue for European football clubs alone for 2017 was rated at $27 bn [3]. In football, creating an optimal lineup of players capable of winning over another list is a significant challenge since player positions require different skills [4]. Furthermore, there is no formula or scientific equation to identify a player's preferred position in the team. The coaches generally do the assignment using their experiences and observations about the players [5], which exposes selection of players to many biases.
For the past two decades, machine learning has become an essential methodology for transforming football statistics into useful information to help teams and guide coaches in analyzing opponents and making better decisions in real time by using sensor-generated data. These data comprise videos from cameras to all types of physical measurements and human monitoring. However, research in the field of football analytics with machine learning techniques is limited. The main reason for this is the lack of a large-scale dataset for players, since collecting such rich information about players might be costly, making sensed data limited to teams with high purchasing power [5]. In football analytics, video games such as FIFA and Football Manager (FM) are considered as other sources of data. Since 2014, researchers and clubs have used video games as alternative sources for data. Shin and Robert used the FIFA video game data to predict match results, and they found that this data can be used in machine learning projects to make predictions with accurate results [6].
The website of SoFIFA classifies simulated players in the FIFA video game series into 14 positions and provides the player ratings on 29 different skills, where each skill is evaluated on a 0-to-100 scale. All of the previous studies, on the other hand, combined these 14 positions into only three classes according to playing roles: defense, midfield, and forward line. Combining these positions stems from two main reasons. First, most players can play in multiple positions, and second, the multi-class distribution of players in the SoFIFA data is highly skewed, leading to a problem in classification accuracy due to class imbalance and high dimensionality in the data. The instability in the classes leads to the problem of class imbalance, which may decrease the prediction performance and extend the training period. The class imbalance problem is considered the most significant issue in data mining. When one of the two classes has more samples than the other, an imbalance problem occurs. In such a situation, most classifiers are biased towards the major classes and hence show abysmal classification rates on minor classes. Although the minority samples rarely occur, they are crucial to some areas, such as detecting fraud in banking operations, finding network intrusions, and diagnosing cancer [7].
Feature selection is a process of choosing a subset of relevant features so that the quality of prediction models can be maintained or improved. Moreover, feature selection is one of the essential data preprocessing steps in data mining. Data sampling seeks to make the data more balanced by adding or removing instances [8].
During the literature review, we identified that many data mining techniques are helpful but not sufficient. However, some papers suggest that applying two or more techniques may give better solutions for the class imbalance problem [9][10][11][12]. This paper presents a large-scale comparison for characterizing football players into nine positions using unbalanced data, namely FIFA video game data. Thus, this paper also explores the role of data mining techniques to improve the performance of machine learning algorithms via treating both high dimensionality and class imbalance problems in the used dataset. Therefore, we propose an approach combining feature selection and data sampling to solve the problems that exist in the above-mentioned data. The proposed methodology for the study comprises three main steps. The first step consists of applying the sampling technique to deal with class imbalance, while the second step covers the feature selection technique, which focuses on the high dimensionality problem. The third step combines feature selection and data sampling to resolve both the issues. We made the comparisons based on nine feature selection algorithms and three balancing techniques. Moreover, their performances were evaluated using the random forest (RF) classifier. Sections 2, 3, 4, 5, 6, 7, and 8 provide a literature review, the steps of the methodology of the study, the FIFA dataset used in the study, the elements of the study, the evaluation metrics, the research design, and the analysis of results, respectively. The last section of the paper encompasses the discussion and conclusions.

Literature review
In this section, we briefly review the literature on characterizing football players' positions and the studies examining the influence of combining resampling and feature selection techniques to manage class imbalance.

Characterizing Football Players' Positions
Characterizing football players for their particular positions according to their skills and specific metrics has attracted coaches' and data scientists' attention. Therefore, most European clubs have enlisted data scientists or algorithms specialists to help them with this requirement. However, the issue of grouping and selecting players based on their individual skills and data using machine learning methods is an open field that has not seen much published research [13,14]. To the best of our knowledge, ours is the first approach that characterizes players in nine positions in a match according to their skills by using a supervised learning approach.
On the other hand, most of the studies in this field focused on characterizing players into three categories according to their positions: defense, midfield, and forward line. We strongly believe that building a model according to nine positions using an extended set could reveal more insights about the characteristics of football players required in each position. Table 1 summarizes the most critical recent research about the machine learning-based characterization of football players' positions; the comparison has been made in terms of classifier, data type, highest accuracy, and number of instances, features, and positions classified.

Study of the Influence of Combined Resampling and Feature Selection Technique
As mentioned in the Introduction section, the imbalance problem occurs when one of the two classes has more samples than the other. In such a situation, most classifiers are biased towards the major classes and give poor classification rates for minor classes. The methods for the classification of the imbalanced dataset are divided into three main categories: the algorithmic approach, datapreprocessing approach (resampling), and feature selection approach. Each of these techniques has pros and cons [7,19,20]. Moreover, by means of the literature review, we observed that studies have examined the effects of combining resampling and feature selection to address the class imbalance. For combining the two techniques, researchers have investigated four different approaches: Approach 1feature selection and modelling based on original data; Approach 2feature selection based on original data and modelling based on sampled data; Approach 3feature selection based on sampled data and modelling based on original data; and Approach 4 feature selection and modelling based on sampled data. The supervised feature selection methods are strongly affected by the distribution of data (the imbalanced problem), such that it is natural to select the features on the sampled data and then have them modelled on the sampled data. We, therefore, present a comparative study using only the fourth approach and our results also confirm the abovementioned viewpoints. In other words, we assume that Approach 1, Approach 2, and Approach 3 were redundant comparative studies. We believe that the experiments conducted in our study will guide future practices in the categorization of class-imbalanced data. Table 2 summarizes the most critical recent research directed towards studying the influence of combining resampling and feature selection to tackle the class imbalance; the comparison has been made in terms of classifier, data type, approaches, and type of feature selection and data sampling algorithm.

Steps of Methodology
Before detailing the proposed methodology, we have reviewed the studies examining the influences of combining resampling and feature selection to solve class imbalance, shed light on the ideal method in modelling, and discussed the techniques used. The proposed methodology consists of three main steps. The first step encompasses applying the sampling technique to deal with class imbalance, while the second step covers the feature selection technique, which deals with the high dimensionality problem. The third step combines feature selection and data sampling to deal with both the issues. In addition to the main steps, two more steps were added to deal with other challenges in the dataset.  C4.5 -Feature selection before sampling is mostly better (A1, mostly, better than A2).
-Under sampling performs better than oversampling (when the dataset is highly imbalanced).
-Only two methods for feature selection are considered: filter and wrapper.
-Only two methods for sampling were considered: RUS and ROS.
-After sampling data in approach A2, the study did not focus on modeling data based on original or sampled data.
-The type of classifier was not considered, as only one classifier was used in all experiments. [12] Software defect prediction -A1 approach performs, on average, better than the A3 and A2 approaches when the RUS was employed. However, the A3 approach performs better than the other two approaches when the oversampling method (ROS or SMOTE).
-Only one method for feature selection is considered (filter).
-Considered three methods for sampling: RUS, ROS and SMOTE [11] Tweet sentiment analysis -Overall, there is little difference between the three approaches.
-The A3 approach performs better than the A1 approaches for the 5:95 dataset (highly imbalanced scenarios).
-Only one method for feature selection is considered (filter).
-Only one method for sampling was considered (RUS).
[24] Webspam detection WEBSPAM-UK2006 and WEBSPAM-UK200 -Feature Selection after Data Sampling C4.5 -The study focuses on constructing an ensemble decision tree classifier through data balancing and select several optimal feature subsets for each subclassifier.
-The study focuses on one scenario for data balancing (feature selection after data sampling using only one method for feature selection and one method for sampling RUS).
[10] Heart failure prediction imbalanced class generates a good enhancement in performance results for all measurements.
-After sampling data in approach A2, the study did not focus on modeling data based on original or sampled data.

1-Normalize and double data:
The original dataset consisted of 17,981 players; however, some players had multiple positions. We, therefore, doubled the number of players to 27,251. The FIFA dataset contained a column that showed the player's preferred position. Afterwards, nine out of 14 play positions were assigned. Later, we performed data normalization on all the dataset features except for the attribute having the preferred position, to ensure consistency. Thus, finally the value of each feature ranges from 0 to 1. Moreover, some of the features have a '+/-' sign. For this reason, we did some calculations instead of keeping them as a string.

2-Data splitting and evaluation models:
After cleaning the dataset, 80% of the data was randomly allocated to train the classifier, while the remaining 20% was used for testing. In classification problems, the simplest way to evaluate an algorithm's performance is to use different training and testing sets. In this technique, the original data is divided into two parts. The first part trains the model and makes predictions on the second part. Moreover, it evaluates the predictions for the expected results, and the split data size is often based on the dataset size. The prevalence of using the training data is between 70 and 80%, while for testing, it is between 20 and 30% [25].

Dataset Description
The difficulty of obtaining large-scale reliable data and the cost problems related to this process were explained in the Introduction section. For these reasons, in this study, we aim to use the FIFA soccer video game data, which is commonly used in the literature. It has been used successfully to predict the results of football matches [6, [26], and we have seen that it was comparable or better than other sources of football data [6].
The EA Sports FIFA video game series system began in 2009. It offers detailed information including weekly updates about a broad set of European soccer players and their skills, which covers three aspects: physical, mental, and technical skills. This information is available on the official website of the game (http://sofifa.com/). The FIFA video game series system has resulted in a huge amount of fine-grained data, which has proven to be particularly useful for coaches, sports analysts, and football fans worldwide [27,28]. In this study, we used the dataset for the FIFA18 game available on Kaggle 1 . It contains 17,980 cases, where each case relates to 1 https://www.kaggle.com/thec03u5/fifa-18-demo-player-dataset one football player. Each football player has more than 70 attributes. These attributes can be divided into personal attributes (e.g., age, nationality, and value), performance attributes (e.g., overall, potential, and stamina), and position (which is classified into 14 positions). For our analysis, we selected 29 continuous variables (player performance indicators on a scale of 0-100) and one categorical variable (player's position).

Sample Analysis
In this paper, we seek to characterize football positions based on individual player skills, covering three aspects: physical, mental, and technical skills [16,29]. Table 3 summarizes these skills.
There are 11 different positions in a soccer team in general: goalkeeper, centre back, full back, wing back, centre midfielder, central attacking midfielder, central defensive midfielder, midfielder, winger, centre forward, and striker. These positions represent both the player's primary role and their operation area on the pitch [14]. The location of these positions on a football pitch is shown in Figure 1.
Previous studies showed that each position has different criteria. For instance, while the criterion 'ball control' is significant for central attacking midfielders, it is less critical for central defensive midfielders (see, e.g., [30]). Therefore, we also seek to find out the essential features required in the process of characterizing players.  2) A goalkeeper is a special position that differs from other positions in terms of some characteristics like 'overhead exit' and 'person-to-person battles'. So, we ignore this position as a separate position.
3) The original dataset consisted of 17,981 players, but since some players had multiple positions, we doubled the number of players to 27,251. 4) To avoid class overlapping, we considered nine primary positions out of 14, where previous studies [30,31] indicated that the skills required for some positions are the same. For example, right and left full back, and right and left midfielder. Table 4 summarizes the nine primary positions. Class overlapping is a critical problem in which data samples appear as valid instances of more than one class. Researchers have found that misclassification often occurs near class boundaries, where overlapping usually occurs as well. Therefore, the class overlapping problem may be responsible for noise in datasets [32,33]. 5-There are 29 relevant features for the prediction in players' position (see Table 3). 6-Imbalanced data. For example, among 27,251 players for nine positions, only 350 players were centre forward, which accounted for only 1.28% of the samples. Figure 2 shows the class imbalance ratios of data. The observations in Figure 2 are summarized in Table 5. Formula (1) represents the method to calculate each class's imbalance ratio [34].
Label cardinality of D is the average number of labels of the examples in D:

Resampling Techniques
A dataset is entitled to be imbalanced if it contains more samples from one class than from the rest. Resampling techniques are considered one of the most commonly used means to deal with imbalanced datasets. Resampling techniques include removing examples from the majority class (undersampling) or duplicating examples from the minority class (oversampling), as shown in Figure 3. Therefore, in this paper, we present a comparative study about the influence of combining these resampling methods and three feature selection methods for tackling class imbalance. For this study, we selected the following resampling methods, which are among the most reported methods in the literature. Additionally, these methods have not been tested with feature selection methods, except in the study presented by [12], in which only one way has been used to feature selection (see Table 2).

Random Undersampling (RUS)
The RUS deletes examples in the majority class and can result in losing information invaluable to a model.

Random Ondersampling (ROS)
The ROS duplicates examples from the minority class in the training dataset and can result in overfitting for some models [35].

Synthetic Minority Oversampling Technique (SMOTE)
In the SMOTE method, each minority class sample is taken and synthetic samples are created by looking at any or all of the sample's k neighbour. Thus, the minority class becomes oversampled. The main difference from other sampling methods is the synthetic samples' production, which is facilitated by looking at their nearest neighbours instead of copying and replicating the minority class samples. The main disadvantage of the SMOTE method is the noise it generates.
Noises are often intricately intertwined with the other class; they confuse the model and are hard to predict [36].

Feature Selection
Feature selection is one of the main preprocessing steps in many machine learning applications. It is a process of selecting a subset of relevant features, reducing data dimensionality for use in model construction (so that prediction performance will be improved or maintained), and speeding up the learning process. Many features may be irrelevant or contain no useful information. Thus, their inclusion may negatively impact classification performance. Therefore, feature selection also helps data miners acquire a better understanding of their data by telling them about the necessary features and their correlation with each other [8,37]. In contrast to other dimensionality reduction techniques, such as those based on projection (e.g., principal component analysis), feature selection techniques do not alter the variables' original representation. Thus, they preserve the original semantics of the variables, thereby offering the advantage of interpretability by a domain expert [38]. In this way, they can find out the required player performance attributes for each position, and a coach would have an objective criterion to select the players.
Feature selection techniques can be broadly categorized into three categories, depending on how they combine the feature selection search with the construction of a classification model: filter, wrapper, and embedded. The following subsections provide a brief explanation of each technique and the most prominent advantages, disadvantages, and algorithms used in this study.

Filter Methods
The random filter feature selection methods use statistical techniques to obtain a specific score and assign it to each feature. By only looking at the intrinsic properties of the data, filter methods can assess the relevance of features [39]. The selection of a subset of features is made as a preprocessing step; this means that after each feature's score is calculated, the low-scoring features are removed, and the remaining are used as predictors in the model construction [38,40].
Thanks to its simplicity, filter feature selection methods are widely used in sports predictions [41]. Examples of this method and the usage areas in sports prediction are information gain, chi-squared, ANOVA [42], mutual information (MI) [43], correlation-based feature selection (CFS), INTERACT algorithm, ReliefF, and minimum redundancy maximum relevance (mRMR) [44].
In this study, we used CFS, chi-squared, MI, and mRMR as filter feature selection methods. The following subsections provide a brief explanation of each algorithm.

Correlation-based feature selection algorithm (CFS)
This method uses the correlation-based heuristic evaluation function to determine the merit of a particular feature subset for predicting the class label and the level of correlation among them. In other words, the CFS is used to calculate subsets for the evaluation of features with the following basic hypotheses, which are based on the heuristic that 'Good feature subsets contain features highly correlated (predictive of) with the classification, yet uncorrelated (not predictive of) to each other'.
The heuristic uses the Pearson's correlation coefficient which can be calculated using the following formula: where is the merit of the current subset of features, is the number of features, ̅̅̅̅ is the mean of the correlations between each feature and the class variable, and ̅̅̅̅ is the mean of the pairwise correlations between every two features [12]. Correlation coefficients whose magnitude is between 0.7 and 0.9 indicate variables that can be considered highly correlated. Moreover, coefficients whose magnitudes are between 0.5 and 0.7 indicate variables that can be considered moderately correlated [45].

Chi-squared (CS)
The chi-squared feature evaluation tells the significance of each of the original features. Based on this, the user can choose to keep the most-significant and discard the leastsignificant features. In the chi-squared feature selection, a feature's significance is measured by the chi-squared test statistic between the feature and the target class. Equation (3) is used to calculate the chi-squared statistic, where 'observed' is the actual number of class observations and 'expected' is the number of class observations that would be expected if there were no relationships between the feature and class. The sum is over each value of the feature since the chi-squared method requires that numeric features be discretized before calculating [46].

−
(3) A high chi-squared test score indicates that the feature and the target class are unlikely to be independent and that, therefore, we should keep the feature in our new dataset.

Mutual information (MI)
The MI is another statistical method used in feature selection. It is the measure of how two variables (x, y) are mutually dependent. It evaluates the 'measure of data' gathered about one arbitrary variable through the other random variable. Equation 4 is used to calculate the MI between two discrete random variables x and y: where p(x, y) is the joint probability function of X and Y, and p(x) and p(y) are the marginal probability distribution functions of X and Y, respectively. For continuous random variables, the summation is replaced by a double integral as

Minimum redundancy maximum relevance (mRMR) technique
The mRMR is a feature selection approach that tends to select features with a high correlation with the class (output) and a low correlation among themselves. For continuous features, the F-statistics can be used to calculate correlation with the class (relevance), and the Pearson's correlation coefficient can be used to calculate the correlation among the features (redundancy). Thereafter, features are selected one by one by applying a greedy search to maximize the objective function, which is a function of relevance and redundancy. Two commonly used types of the objective functions are mutual information difference (MID) criterion and mutual information quotient (MIQ) criterion, which represent the difference or the quotient of relevance and redundancy [47].

Wrapper Methods
The wrapper feature selection methods generate several feature subsets evaluated according to their predictive power when used with a specific classifier [39]. As described by Saeys et al., a search procedure in the space of possible feature subsets is defined, and various subsets of features are generated and evaluated. The evaluation of a specific subset of features is obtained by training and testing a specific classification model. A search algorithm is then 'wrapped' around the classification model to search the space for all feature subsets. The application of wrapper methods to highdimensional datasets requires special attention since with the increase in number of features, the space of feature subsets grows exponentially and becomes computationally impossible. The heuristic search methods are used to guide the search for an optimal subset of features to tackle this problem [38,40]. The two most common greedy searching techniques used to perform wrapper-style feature selection are sequential feature selection and recursive feature elimination (RFE). Sequential feature selection algorithms can be either forward as sequential forward selection (SFS) or backward as sequential backward elimination (SBE). In this study, we used SFS, SBE, and RFE as wrapper feature selection methods. The following subsections provide a brief explanation of each algorithm.

Sequential Forward Selection (SFS)
The SFS starts from the empty set. It performs best when only a small number of features are involved. Nonetheless, the main disadvantage of SFS is that it cannot remove features that become insignificant after the addition of other features.

Sequential Backward Elimination (SBE)
The SBE works in the opposite way to that of the SFS. The SBE starts with a full set of features. It works best with many features in the dataset [48].

Recursive Feature Elimination (RFE)
Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the RFE aims to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained through any specific attribute. Then, the least important features are pruned from the current set of features. This procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached [49].

Embedded Methods
As for the wrappers, the embedded methods depend on a specific learning algorithm. Further, while the search and evaluation procedures are separated in the wrappers, the embedded method performs feature selection in the classifier construction using its internal parameters. Therefore, they are faster than the wrappers and are more efficient as they avoid the use of all the available data by not needing to divide the data into a training set and a test set [50].
Decision trees such as RF, extra tree, and XGBoost are popular approaches for embedded methods. Other embedded methods are the least absolute shrinkage and selection operator (LASSO) with the L1 penalty and ridge with the L2 penalty for constructing a linear model. These two methods shrink many features to zero or almost near zero [51]. In this study, we used RF and LASSO as embedded feature selection methods. The following subsections provide a brief explanation of each algorithm.

Embedded-random forest (RF)
The feature evaluation approach based on the RF is known as embedded method [52]. It provides a variable importance criterion for each feature by computing the mean decrease in the classification accuracy for out-of-bag (OOB) data from bootstrap sampling [53]. Assuming bootstrap samples b = 1, …, B, the mean decreases in classification accuracy ̅ for variable as the importance measure is given by − Where denotes the classification accuracy for OOB data using the classification model ; and is the classification accuracy for OOB data permuted the values of variable in (j = 1, …, N). Finally, a z-score of variable representing the variable importance criterion could be computed using the formula = ̅̅̅̅ √ , after the standard deviation of the classification accuracy decrease has been calculated.

Embedded-least absolute shrinkage and selection operator (LASSO)
The LASSO is a powerful method that helps perform regularization (L1) and feature selection of the given data. It penalizes the beta coefficients in a model. The LASSO method limits the sum of the values of the model parameters, where the sum has to be less than the specific fixed value. This shrinks some of the coefficients to zero, indicating that a particular predictor or certain features will be multiplied by zero to estimate the target. During this process, the variables that have a non-zero coefficient after shrinking are selected to be a part of the model. It also adds a penalty term to the cost function with a lambda value tuned [51]. This is how the LASSO reduces the overfitting caused and helps in feature selection; it uses the following equation: When lambda (λ) is 0, the equation is reduced, leading to no elimination of the parameters. An increase in λ causes an increase in bias, and a decrease in λ causes an increase in variance.

Classification
The final step of our proposed methodology involves a supervised learning predictive model. The classification stage aims to characterize football players into nine positions. In this study, we used only one classifier in empirical comparisons as we seek to increase classification accuracy based on balancing techniques and feature selection regardless of the classifier. Therefore, the RF was selected for this task. The RF was chosen owing to its frequent use in the literature of characterizing players [17,18] and data mining domains. Moreover, it is a relatively fast state-of-theart algorithm [12,54].

Random Forest (RF)
The RF is an ensemble classification approach that has proved its high accuracy and superiority. The RF consists of several uncorrelated decision trees. For a classification operation, the RF classifier creates a set of decision trees from a randomly selected subset of training data. It then collects the votes from different decision trees to decide the final class of the test target. The general architecture of the RF is shown in Figure 4.
The RF was first introduced in 1999 by Leo Breiman. In his studies, Breiman explored various methods of randomization of decision trees (sampling), for example, using bagging or boosting [55]. In bootstrap, the classifier creates new datasets from the original data and then calculates the average errors in these groups to estimate variance. (Unlike cross-validation sampling like hold out in which data is divided into two parts for training and testing.) As for the RF, its hallmarks mainly include [56] 1) Bootstrap sampling (bagging)randomly selecting number of samples with replacement.

Evaluation Metrics
In machine learning, several metrics are used to evaluate the performance of the classification models. Generally, statistical methods, such as hold-out (train-and-test split), cross-validation, and bootstrap, can be used with predictive models to get estimates of model performance using the training set [57]. Confusion matrix, classification report, and accuracy are considered as the most critical metrics for evaluating the classification models using the testing data [25].

Hold-out (train and test split)
In classification problems, the simplest way to evaluate the algorithm's performance is to use different training and testing sets. In this technique, the original dataset is split into two parts. The first part trains the algorithm and makes predictions on the second part and then evaluates predictions against the expected results. Generally, the size of the split data is based on the size of the dataset. It the common to use 70-90% of the data for training and 10-30% for testing [25]. In this study, we used the train-and-test split for splitting data. The samples (see Table 5) were randomly divided into 70% for training and 30% for testing.

Confusion Matrix
A confusion matrix is a practical presentation of the accuracy of a model with two or more classes. The matrix displays predictions on the x-axis and accuracy outcomes on the yaxis. The matrix cells are the number of predictions made by the algorithm [25], as shown in Figure 5. Like the previous studies [58], the minority class was considered positive, while the majority class was considered negative. Therefore, according to Tables 4 and 5, the centre forward position was regarded as positive (minority class) while the midfielder was considered as negative (majority class), which means TP: The player is a midfielder and is classified as a midfielder. FP: The player is a midfielder and is classified as a centre forward. TN: The player is a centre forward and is classified as a centre forward. FN: The player is a centre forward and is classified as a midfielder. VOLUME XX, 2017

Classification Report
Classification report provides a convenient representation when working on classification problems to give you a quick idea of a model's accuracy using several measures derived from the confusion matrix for the model. The classification report displays the precision, recall, F1-score, and support (the number of actual occurrences of the class in the specified dataset). These metrics give a more profound intuition of the classifier behaviour over total accuracy, which can mask functional weaknesses in one type of binary or multi-class problem. In binary classification, the precision, recall, and F1-score are defined as shown in formulas (8), (9), and (10), respectively [59]. However, in multi-classification, it can compute the performance measures in the same way as it can define one class as positive and the other as unfavourable.
Precision= (8) Recall= (9) F1 score= (10) Since we are dealing with an imbalanced class problem, recall is an important metric to consider. From the football point of view, having high values of FN is not good. Midfielders are usually good at defense and offence, unlike the centre forward players who are not required to be good at defense. This means, having too many FP is not as severe as the latter case.

Classification accuracy
A typical metric for measuring the performance of learning systems is the classification accuracy rate. It is the number of correct predictions made divided by the total number of predictions made. Classification accuracy is considered as the most popular evaluation metric for classification problems in machine learning [25]. However, empirical evidence shows that this measure is biased regarding the data imbalance and proportions of correct and incorrect classifications [60]. Therefore, these shortcomings have motivated the search for new measures such as precision, recall and F1-score. Classification accuracy is defined in formula (11):

Research Design
In this study, we aim to create a machine learning classifier to characterize football players' positions. Moreover, we seek to address the imbalance problem in the dataset.

Research Questions
The research questions for this study are as follows: Research Question 1: Can machine learning algorithms make recommendations to improve team performance? Research Question 2: Can data mining techniques improve the performance of machine learning algorithms?
To answer the first question, we discuss the implementation of the baseline algorithm in Section (8.1) and evaluation of its performance on our data. Thus, we explore one of the primary aspects of sports analytics in football using a supervised predictive model to characterize players according to nine positions. For the second question, in Sections (8.2) and (8.3), we discuss the importance of applying two preprocessing techniques, resampling and feature selection, to jointly reduce the complexity of training datasets and solve the class imbalance problem. We used three different algorithms of the following preprocessing techniques: RUS, ROS, and SMOTE. For resampling, nine methods for feature selection were used to evaluate the effectiveness of the various techniques and build a machine learning classifier.

Design of comparative experiments
A Python module called scikit-learn was used to build machine learning models and execute feature selection algorithms. Moreover, a python toolbox called imbalancedlearn API was used to tackle the curse of imbalanced datasets. All models were created using the default parameters unless otherwise noted. The design of comparative experiments was based on the 3×9 crossings of resampling and feature selection methods using the RF classifier, which produced four different combinations: 1) Baseline (one model).
3) Feature selection versus baseline (nine models). 4) Joint application of resampling and feature selection (27 models). Figure 6 shows the results of four previous combinations in terms of the confusion matrix for the dataset, which used to evaluate the algorithm in Research Question 1 and to develop the predictive models for Research Question 2 (consisting of 40 models). In the next section, the results of these matrices is interpreted and clearly presented in terms of precision, recall, and F1-score. VOLUME XX, 2017

Analysis of results
The results in Figure 6 were analyzed in the following three ways, organized from a low to a high level of detail. (Each comparative analysis involves all the possible cases obtained from the combination of a classifier [see 5.3], a data partition [see 6.1], and performance measures [see 6.2, 6.3, and 6.4].

Baseline Model
A baseline is a simple procedure for making predictions on a specific predictive problem. The skill of this model provides the bedrock for the lowest acceptable performance of a machine learning model on the original dataset, by which all other models can be evaluated. If a model achieves performance at or below the baseline, it means that something is wrong or the model is not appropriate for your problem. Random forest is used to establish the baseline model in our experiments. The classification report results in terms of accuracy, precision and recall, which were summarized in Figure 7.
We acknowledge that tuning the algorithm's parameters can lead to better results, but we adopt the classification model's default parameters in all the experiments. Thus, we seek to maintain baseline performance as the basis for comparison. The focus of this study is not to examine the pros and cons of the used classification models. However, it focuses on investigating the joint influence of resampling and feature selection for tackling class imbalance.

Resampling versus Baseline (A1)
In this sublevel, the resampling techniques used in the study were analyzed. In the first column of Figure 6, the classification results obtained from resampled training sets were compared with those provided by the corresponding original training sets in terms of the confusion matrix. Figure 8 shows the resampled sets after applying the three resampling techniques in terms of accuracy, precision and recall. Owing to random behaviour of the RUS, ROS, and SMOTE, the resampled sets were randomly divided into 70% for training and 30% for testing in each experiment involving these techniques. The results obtained from these experiments are summarized as follows: The use of resampling techniques improved accuracy, precision, and recall compared with baseline, and the ROS had a relative advantage compared to other methods of balancing.

Feature selection versus Baseline (A2)
In this sublevel, the feature selection techniques used in the study were analyzed. Table 6 shows the new subsets whose dimensionality was reduced by the nine feature selection techniques, where nine subsets were produced. The implementations of the feature selection techniques used are those included in the scikit-learn library with their default parameters except for some parameters set in advance. In the CSF, the most critical parameters are the correlations between each feature and the class variable. We set it to 0.5 in our experiments, which leads to a subset of 10 features being produced. For the chi-squared algorithm, we set (k = 10) for all the datasets to specify the 10 best features with highest chi-squared statistics. For the MI algorithm, the 10 features with the highest MI score were selected. For the mRMR algorithm, the MIQ criterion was used as an objective function to specify the 10 best features that have high correlation with the class. For SFS, SBE, and RFE algorithms, subsets of 10 features were generated and evaluated by the RF classifier. For the LASSO algorithm, we set alpha to 0.1 in our experiments, and a subset of 11 features was produced.
Analysing the results, in the first row in Figure 6, the classification results obtained from the training and test sets whose dimensionality was reduced by feature selection techniques are compared with those provided by the corresponding original training sets (baseline) in terms of the confusion matrix. The classification report results in terms of accuracy, precision, and recall are summarized in Figure 9, for sets whose dimensionality was reduced. The results obtained from this experiment can be summarized as follows: The use of feature selection techniques alone tends to deteriorate results for all models compared with the baseline model in terms of accuracy, precision, and recall; thus, the experiments demonstrate that the evaluated feature selection techniques did not improve the accuracy of the classifier.

Level B (a middle level of analysis)
In this level, the classification results obtained from the joint application of resampling and feature selection were compared with baseline results in terms of the confusion matrix (see the second, third, and fourth rows in Figure 6). Experiments in this level included the following steps (which represent the main methodology proposed for this study): 1) Applying the sampling technique to deal with class imbalance.
2) Using the feature selection technique to deal with the high dimensionality problem.
3) Modelling the models based on sampled data and the new subset selected by feature selection techniques. It is worth noting that applying feature selection to the balanced data produced subsets that differed slightly from those that resulted from applying feature selection to the unbalanced data (Tables 7, 8, and 9).
In Figure 11, the results of these experiments in terms of accuracy, precision, and recall are summarized in Figure 10. Experimental comparisons with baseline model are made on the average basis of the average accuracy, precision, and recall for the nine feature selection methods over the three balancing methods RUS, ROS, and SMOTE. The following inference can be made from Figures 10 and 11: -The use of feature selection with data balanced by the RUS method leads to deteriorating results for all models compared with the baseline in terms of accuracy, precision, and recall.
-The use of feature selection with data balanced by the ROS and SMOTE methods leads to improved accuracy, precision, and recall compared with the baseline model. -The use of feature selection with data balanced by the ROS and SMOTE methods leads to improved accuracy, precision, and recall compared with the baseline. Thus, there is no single filter, wrapper, or embedded-based feature selection method that is the best. Therefore, the experimental comparisons with the baseline were made on the basis of the average accuracy, precision, and recall. Appendix A shows the performance evaluation of all tested models for each class, in addition to the baseline.

Level C (the lowest level of analysis)
At this level, the results of the proposed methodology for the joint application of resampling and feature selection analyzed in Level B were compared with the results obtained only by the use of a single technique and from the original imbalanced training set analyzed in Level A. The results of all the previous experiments of Levels A and B versus the baseline model are summarized in Figure 12. The results obtained from this figure can be summarized as follows: Agility  4 Balance  5 Ball control 6 Composure 7 Crossing Dribbling          -The results showed superiority of the proposed methodology, involving the joint application of resampling and feature selection with data balanced by the ROS and SMOTE methods, compared to the results obtained only by the use of a single technique and from the original imbalanced training set analyzed in Level A.
-By comparing the results obtained from Figures 10 and  12, the most accurate model (embedded-RF feature selection and ROS) achieved an accuracy of 57.3%, precision of 56.4%, and recall of 57.4%. The model built using the RFE feature selection and ROS had a comparable accuracy and precision of 57.2% and 57.2%, respectively, and a recall of 56.3%.
-The proposed methodology improved prediction accuracy compared to baseline. Moreover, it produced a drastic reduction in the number of features, from 29 to 10 on average. This means these features, at least in a statistical sense, are the most influential factors for predicting player position.
-Based on the model that achieved the highest accuracy (embedded-RF feature selection and ROS) and Table 8, the most important attributes in characterizing a player's position are crossing, dribbling, finishing, heading accuracy, interceptions, long passing, marking, positioning, sliding tackle, standing tackle, strength, and vision.
-This model could be used as an initial model for characterizing football players according to the multivariate performance data; this information can be beneficial to coaches since it can be used as an objective criterion for evaluating a player. Figure 10. The results of the approaches that combine resampling and feature selection models (in terms of accuracy, precision and recall) Figure 11. The results of the approaches that combine resampling and feature selection models (based on the average accuracy, precision, and recall for the nine feature selection methods)

Research Question 1
To answer the Research Question 1 (Can machine learning algorithms make recommendations to improve team performance?), we implemented the baseline algorithm using an RF classifier to characterize football player's positions and evaluate their performance on our data. Since the data used in the study were unbalanced, the accuracy of the model did not exceed 37%.

Research Question 2
To answer the Research Question 2 (Can data mining techniques improve the performance of machine learning algorithms?), we examined the importance of applying two preprocessing techniques, re-sampling and feature selection, to jointly reduce the complexity of training datasets and solve the class imbalance problem by making empirical comparisons; a total of 40 predictive models were tested. The proposed methodology for the study consisted of three main steps. The first step consisted of applying the sampling technique to deal with class imbalance; the second step consisted of the feature selection technique, which dealt with the high dimensionality problem, and the third step combined feature selection and data sampling to deal with both the issues.
Our approach goes beyond the studies presented in Table 2. We offer a comprehensive study in which we uses nine selection algorithms based on the main feature selection algorithmsfilter, wrapper, and embedded in addition to three methods for data balancing. We trained models using the RF as an objective function for each position. Based on the experiments, we concluded that 1) feature selection techniques did not improve the accuracy of the baseline model, 2) balancing techniques improved accuracy compared to the baseline, and 3) the results showed superiority of the proposed methodology, involving the joint application of resampling and feature selection with data balanced by the ROS and SMOTE, compared to the results obtained only through the use of a single technique and from the original imbalanced training set.
Overall, the proposed methodology improved the prediction accuracy compared to the baseline, and an accuracy of more than 57% was reported. Moreover, the proposed methodology provided a significant decrease in the number of features, from 29 to 10 on average. This means these features, at least in a statistical sense, are the most influential for predicting player position. This information can be beneficial to coaches since these features can be used as an objective criterion for evaluating a player. Moreover, this model could be used as an initial model for characterizing football players according to the multivariate performance data.
On the other hand, regarding player position, our approach goes beyond the studies presented in Table 1, which were limited to classifying players into the three central positions (defender, midfielder, and attacker). In contrast, we sought to find the specific role in those positions (e.g., centre midfielder or central attacking midfielder).
This study supports the concept that specific performance indicators define each position of players in football. Additionally, we believe that the quantitative analysis of the multivariate performance data using machine learning methods (like classification) is an essential step in this process.
Finally, our study has shown that the data collected from video games such as FIFA could improve prediction quality. Furthermore, these games can also be used as an essential source for retrieving sports data and executing artificial intelligence analyses.