Implementation of Feature Selection Strategies to Enhance Classification Using XGBoost and Decision Tree

. Purpose: Grades in the world of education are often a benchmark for students to be considered successful or not during the learning period. The facilities and teaching staff provided by schools with the same portion do not make student grades the same, the value gap is still found in every school. The purpose of this research is to produce a better accuracy rate by applying feature selection Information Gain (IG), Recursive Feature Elimination (RFE), Lasso, and Hybrid (RFE + Mutual Information) using XGBoost and Decision Tree models. Methods: This research was conducted using 649 Portuguese course student data that had been pre-processed according to data requirements, then, feature selection was carried out to select features that affect the target, after that all data can be classified using XGBoost and Decision tree, finally evaluating and displaying the results. Results: The results showed that feature selection Information Gain combined with the XGBoost algorithm has the best accuracy results compared to others, which is 81.53%. Novelty: The contribution of this research is to improve the classification accuracy results of previous research by using 2 traditional machine learning algorithms and some feature selection.


INTRODUCTION
The success of a country is often gauged by indicators such as education and the economy.Despite their significance, these factors present challenges, especially in developing nations, where education is crucial for addressing issues like poverty and wielding transformative power on individuals, societies, and nations [1].Recognizing the pivotal role of education, the Indonesian government introduced the 12-year mandatory education program (WAJAR) to improve equal access to quality education [2].While these initiatives are commendable, grading remains a universal criterion, for advancing to higher education [3].
Various factors, particularly those originating from students, can influence academic performance [4].Acknowledging the importance of academic monitoring, educators must actively manage it to improve quality and performance [5].Early identification of potential academic challenges requires teachers to discern factors contributing to students' struggles [6].Technological advancements facilitate the swift identification of these factors using data mining [7].Educational Data Mining (EDM) is a burgeoning field that enables researchers to extract valuable insights or patterns from extensive datasets, minimizing decision-making risks in education [8]- [12].Although relatively new, EDM has been widely used due to its potential to help educators and institutions utilize data-driven insights for more efficient operational processes and extraction of new knowledge from large student data sets [13], [14].classifying using Wrapper type feature selection with several algorithms resulted in Random Forest combined feature selection wrapper having the highest accuracy of 77.20% compared to J48 and Naive Bayes with an accuracy of 74.88% and 72.57% [16].
In addition, some researchers only look for which variables are most influential on student academic grades.For example, research conducted by S. Rajendran et al [17] focused on identifying parameters that affect students' academic grades.It was found that health-conscious lifestyle and stress had a positive correlation with academic performance.To determine the influential parameters, the study used feature importance from several machine learning algorithms, such as ANN, random forest, gradient boosting, and stacking algorithms which showed that lifestyle factors (physical activity and optimistic thinking) became the main feature that had a relative feature value of more than 75 in influencing academic performance.Then, stress becomes a significant feature, especially in gradient boosting and stacking with relative features that almost reach 100.Other researchers such as Fernandes et al [18] also conducted similar research to determine the factors that influence student academics using the Gradient Boosting Machine (GBM) algorithm importance features took some features that have an importance scale of more than 0.35 and got the results that grades, attendance, environment, school, and age are indicators that have the potential to determine academic achievement.Both previous studies used 2 different data.However, this informs us that there are quite a lot of features from both inside and outside the individual student that affect academic performance.This shows that feature selection in data processing is an important thing to do, because even though data has many features, in reality, not all of these features are important features.Some features can be removed to simplify the analysis without reducing the accuracy value of a classification.
Feature selection itself is not new in data processing.For example, research conducted by Fathania Firwan F et al [19] on classification approaches for heart disease prediction said that there are several feature selection methods namely Filter, Wrapper, Embedded, and Hybrid with the advantages and disadvantages of each method.Feature selection itself is a knowledge technique to find a subset of the original feature set that efficiently represents the input data while reducing the impact of noise and irrelevant features but still provides relatively excellent results for the task and helps analysts obtain good classification performance [20].There are also several reasons why feature selection is important, namely reducing the number of parameters, reducing training time, and minimizing over-fitting problems [21].
Although some studies above have made valuable contributions related to the classification of the Portuguese student dataset, there is a significant research gap.Previous research only considered one type of feature selection, namely wrapper, without exploring the potential benefits of other feature selection methods.Therefore, this study attempts to fill this gap by utilizing four types of feature selection -filter, wrapper, embedded, and hybrid -using XGBoost and Decision Tree models, because in tabular data classification Decision Tree remains the best choice in terms of performance and training time.According to S. Fayaz et al and V. Borisov et al [22], [23], in their research, (Gradient Boosted Decision Tree) GBDT still dominates and shows superior performance applied to tabular data, and the XGBoost algorithm is considered a recommended choice for real-life tabular data problems.In this context, the uniqueness of our research lies in combining correlation and importance-based feature selection methods.The most relevant features are selected to improve the accuracy of value classification.By prioritizing features that have a significant impact and selecting the optimal classification model, namely Decision Tree and XGBoost.This approach not only improves prediction accuracy but also results in a more efficient model for handling tabular data.Therefore, this research is expected to provide additional contributions to a practical understanding of how various feature selection methods can impact the performance of EDM classification models, thus enhancing the quality of education through more effective planning [24].

METHODS
The research process is illustrated in Figure 1, encompassing dataset preprocessing, data splitting, feature selection, and modeling.In this section, each step will be elaborated in detail in the next section to provide a comprehensive understanding of the research process.

Preprocessing data
Preprocessing plays a crucial role in data processing, addressing issues like noise, outliers, and irrelevant attributes in raw data.Data cleaning, particularly handling outliers, is a pivotal step.The Winsorize technique is employed in this dataset for outlier handling, replacing extreme values with the nearest values [25], [26].Binning of G3 data follows the Portuguese higher education system, categorizing it into ranges like 0-9, 10-13, 14-15, 16-17, 18-19, and 20 [14].Dataset splitting divides the data into training (80%) and testing (20%) subsets.Data balancing through SMOTE overcomes imbalance issues effectively by generating synthetic minority class samples [27].Feature selection aims to obtain a minimal, informative subset, excluding irrelevant or highly correlated features [28].The process involves selecting k-best features, ranging from 5 to 15 features with an increment of 5, denoted as k ∈ {5, 10, 15} [29].Finally, data normalization, utilizing Z-Score transformation, ensures consistent ranges between data points [27].

Feature selection
Feature selection has several categories, including filtering, wrapping, embedded, and hybrid.In this research, all categories are used to find out which feature selection is most suitable for the dataset.The filtering category has a way of selecting variables based on rank.So, if there is a variable that is below the threshold, it will be eliminated so that in the end the relevant features are obtained.In this case, filtering uses the Information Gain (IG) algorithm which has a working system that ignores features that have little IG value because these features are not very influential on accuracy results or are arguably irrelevant features.In addition, IG also has advantages, such as increasing effectiveness and accuracy, and can also reduce complexity [19], [30], [31].Then for wrapping, there is Recursive Feature Elimination (RFE) is a feature selection method that aims to identify the most suitable feature subset by utilizing the learned model and classification accuracy.RFE falls under wrapping because it uses a supervised methodology and is wrapped iteratively to remove the worst features based on the target [28], [32].For embedded in the process of feature selection, choose features Lasso that have non-zero coefficients after applying shrinkage, while those with exactly zero coefficients are excluded.The tuning parameter, also known as the regularization parameter (λ), is used to control the degree of regularization.There are several benefits of using Lasso, such as helping prevent overfitting problems, resulting in better generalization, and improving interpretability by canceling irrelevant features [33], [34].Finally, hybrids include mixtures of many feature options.This study employed a mix of Mutual Information and RFE.

XGBoost algorithm
XGBoost stands for Extreme Gradient Boost which completes the learning task by building or combining several weak learning models to become a strong learning model iteratively [35], [36].This is a simplified group calculation depending on the GBDT (Gradient Boosted Decision Tree).The premise of improving computation is that multiple decision trees perform superior to a single one.Any decision tree can make for a terrible show.When multiple trees are combined, the presentation shows signs of improvement.
In this experiment, the input data is in the form of the final attribute that has been selected above.The XGboost formula can be seen in formulas ( 1) and (2).
= ∑ (  , (  )) (1) (2) Where T is the number of leaves on the tree and w is the output score of the leaves.A higher  value will result in a simpler tree.The value  controls the minimum loss reduction gain required to divide the internal nodes [37].

Decision tree algorithm
Decision tree is one of the popular and effective algorithms in data mining, especially for classification problems.In the context of education, using a Decision Tree classifier can help improve education by identifying patterns and factors that contribute to student grades.With a good understanding of this method, using a decision tree classifier can make an important contribution to improving education through better data analysis and decision-making.The process of making a decision tree is, first the entropy class of each attribute is calculated, and then all the information gained as in the formulas (3) until ( 5

Evaluation
In the context of assessing model performance, a commonly employed method is the utilization of a confusion matrix.The confusion matrix provides a detailed breakdown of predicted classifications versus actual classifications, forming the foundation for further evaluation.Within the evaluation subsection, key metrics such as precision, recall, and accuracy can be incorporated.Precision is used to find out the true positive predictions for the overall results predicted.The formula for calculating the precision value is in equation ( 6) [38]. =  + (6) Recall is the ratio used to compare true positive predictions with the sum of true positives and false negatives.The formula for calculating the recall value is in equation (7). =  + (7) Accuracy is used to see the ratio that is correctly predicted with all data using the formula in equation ( 8) [39].

RESULTS AND DISCUSSIONS
The experimental analysis conducted involved evaluating the performance of XGBoost and Decision Tree models by considering four feature selection methods (IG, RFE, Lasso, and Hybrid) while varying the value of K as the number of top features selected (K=5, 10, 15).The graph below provides visual insights into how the accuracy of the models evolves with changes in the value of K.
Figure 2. Relationship between the number of features selected and accuracy Figure 2 reflects differences in the model's performance as the value of K varies.Increases in accuracy are noticeable at specific values of K, indicating that proper feature selection can contribute positively to the model's performance.After thorough analysis, the results indicate that the optimal values of K differ for both models and each feature selection method, leading to unique characteristics and varying key metric values.Consequently, Table 1 is presented to provide a detailed breakdown of the highest precision, recall, and accuracy for each model and feature selection method.Table 1 provides a comprehensive overview of the highest performance for each model with the selected optimal values of K, aiming to enhance classification performance by choosing attributes aligned with the target and reducing complexity.Notably, the XGBoost algorithm, when coupled with Information Gain, attains the highest accuracy at 81.53%.This success can be attributed to several factors.Firstly, XGBoost is adept at comprehending intricate data relationships through the amalgamation of numerous small learners.Its design prevents overfitting to training data, enhancing its generalization to new data [40], [41].
Information Gain further refines the model by selecting pivotal features and directing its attention to essential components [42].The ability of XGBoost to handle interrelated features and complex patterns contributes significantly to its robust performance across diverse data types.The synergy between XGBoost and Information Gain renders the model both resilient and accurate.Complementary to this, Table 2 outlines the features selected through the feature selection method, as a basis for reducing redundancy.To reinforce the reliability of our findings, it is worth mentioning that prior research on heart disease datasets employed a consistent methodology XGBoost and IG resulting in an impressive accuracy of 93.44%, despite the distinct nature of the datasets [40].This consistency underscores the reliability and robustness of the XGBoost with the IG approach.The effectiveness of XGBoost with IG for the classification of student grades offers practical implications for educational support, supported by the method's consistent success across diverse datasets.

Figure 3. Comparison accuracy
The effectiveness of XGBoost with Information Gain (IG) in the classification of students' grades in Table 3 and Figure 3 presents practical implications for educational support, bolstered by the consistent success of the method across diverse datasets.Moreover, the integration of sophisticated techniques like these into educational practices has the potential to significantly improve the quality of education.By harnessing insights provided by XGBoost and Information Gain, educators can adapt their teaching strategies, identify at-risk students, and implement targeted interventions, thereby fostering a more effective and personalized learning environment.The application of this methodology aligns with the common trend of leveraging data-driven approaches to enhance educational outcomes, underscoring the importance of embracing technological advancements for educational improvement.

CONCLUSION
In response to the challenges of assessment in education, this research focused on the classification of Portuguese final grades (G3).The results confirmed that using the XGBoost algorithm with Information Gain (IG) feature selection provided the best performance with an accuracy of 81.53%.The implication is that this grade classification system can effectively help teachers analyze students who need special attention before exams to achieve optimal results.For future research, it is recommended to consider using the latest datasets and explore deep learning methods, such as neural networks, to improve accuracy with the ability to capture more complex patterns.

Table 1 .
The result of XGBoost and Decision Tree before and after using feature selection

Table 2 .
Feature selection

Table 3 .
The feature selection results in Table2indicate that the significant attributes for the classification of final grades involve variables such as age, Medu, Fedu, Mjob, studytime, failures, freetime, goout, as well as the grades G1 and G2.This step is crucial in the context of accuracy comparison with previous research, as demonstrated in Table3and Figure3.Comparison of research results with other studies

Table 3 ,
the implementation of XGBoost and Information Gain (IG) on the student dataset demonstrated a notable accuracy of 81.53%, surpassing other methods, including Random Forest and Wrapper feature selection.This superiority is attributed to XGBoost's inherent capability to capture complex patterns and the pivotal feature selection by Information Gain, as discussed earlier.It is crucial to note that the Random Forest method in the comparison table also applied to the same dataset.