Effect of Hyperparameter Tuning Using Random Search on Tree-Based Classification Algorithm for Software Defect Prediction

kualitas


INTRODUCTION
The expanding influence and complexity of software are being observed across diverse domains of our lives.Software becomes more intricate, the process of rectifying failures or defects becomes increasingly challenging [1].It is possible to anticipate software failures by forecasting software defects in the preliminary stages of software development, as rectifying them in the later stages would be more costly and challenging to identify.Software defects denote flaws, inaccuracies, or malfunctions within a computer system or program, which may lead to unexpected or erroneous outcomes, thereby impeding the intended functionality of the software and causing a decline in its quality.This deterioration in software quality in itself poses a disadvantage.To ensure the attainment of high-quality software, the final product must exhibit minimal defects.Early identification of software defects can curtail development expenses and rework efforts and yield more dependable software [2].Predicting software defects is of utmost significance to address software issues while enhancing software quality [3].The prediction of software defects involves scrutinizing software metrics and subsequently constructing models for defect prognostication.Defect defects in software modules are identified through classification, a method commonly employed by numerous studies [4].Using metrics to predict software damage is pivotal in developing prognostic models to enhance software quality by forecasting the maximum number of software breakdowns.
In Andini et al.'s research using tree-based classification with Grid Search hyperparameter tuning, the average AUC value obtained was 0.69.Random Forest generated an average AUC value of 0.76, whereas Deep Forest produced an average AUC value of 0.79 [5].
In the second research, Afrizal et al. adopted the technique of hyperparameter tuning using Random Search to increase the performance of software defect prediction as the selection of hyperparameters.According to the study's results, hyperparameter tuning by Random Search was useful for the tuning parameter search issue.As a result, without hyperparameter change, the XGBoost classification obtained an accuracy of 95.34%, a recall of 93.78%, and a precision of 95.63%.With hyperparameter tuning, XGBoost classification achieved an accuracy of 95.34%, recall of 95.63%, and precision of 98.44%.Using Random Search in XGBoost for hyperparameter tuning resulted in an estimated 2.35% improvement in accuracy, 2.55% rise in recall, and 2.81% increase in precision [6].In another study, Zhou et al. offered numerous approaches for software defect prediction, including Random Forest (RF), Naive Bayes (NB), Support Vector Machine (SVM), Logistic Regression (LR), and Deep Belief Networks (DBN).NASA, PROMISE, AEEEM, and ReLink databases were all used.Based on the comparative findings, it was discovered that DPDF performed the best for the NASA dataset, with the AUC increasing and reaching the maximum value of 92%.DPDF also outperformed the others in the PROMISE and AEEEM datasets, delivering score gains of 89% and 86%, respectively.However, in several ReLink datasets, DPDF did not outperform RF and DBN, with a maximum score of 75% [7].
Based on previous researchers' exposure to improved performance, this study will employ the utilization of hyperparameters in order to make predictions regarding software defects.This research will be achieved by implementing Random Search techniques for treebased classifications, specifically decision trees, random forests, and deep forests.

METHODS
This research method provides a detailed account of the datasets that were utilized in this study.It also explains the preprocessing techniques employed to prepare the data for analysis.It also dives into the categorization algorithms used, Decision 97 using cross-validation, which ensured the results' robustness.Moreover, it highlights the hyperparameter search that was conducted using Random Search, which aimed to optimize the performance of the classification algorithms.Lastly, it discusses the performance measurement employed, namely the AUC evaluation method.Moving on to the research procedures that will be carried out, Figure 1 depicts the flow of this study and serves as a reference for the remaining stages that will be conducted.

Figure 1 Research flow
The first stage of this research was to assemble the ReLink datasets, which included Apache, Safe, and Zxing.The dataset was then preprocessed using label encoding and z-score normalization.Following that, data exchange was achieved by cross-validation.This research used 10-fold cross-validation as its validation approach.Each ReLink dataset was separated into ten pieces, nine of which were allocated for training and the remaining portions for testing.Following that, an ideal tuning hyperparameter search was performed using Random Search prior to the learning phase.This study's learning phase included three distinct scenarios: classification using a Decision Tree, classification using a Random Forest, and classification using a Deep Forest.The mean AUC ratings were used to evaluate this research.

Data Collection
The study employed a software metric dataset called ReLink, composed of Apache, Safe, and Zxing data.This dataset is accessible for download at the subsequent hyperlink: https://github.com/bharlow058/AEEEM-and-other-SDP-datasets/tree/master/dataset/Relink.

Preprocessing
Prior to the execution of data sharing, the data shall be subjected to adjustments in order to cater to the requirements of the algorithm.The process of preprocessing serves the purpose of customizing the data to suit classification algorithms, which can enhance the performance outcomes of classification models [5].The preprocessing of the data employed in this study encompasses label encoding and normalization.Label Encoding refers to transforming label 99 values into numeric representation [8].For each model utilized to handle class labels effectively, it is imperative to substitute the label values within the ReLink dataset with their corresponding numeric equivalents.Normalization is a crucial procedure employed during the preprocessing stage whereby numerical attribute data is decomposed, enabling the conversion of values within the data into a specific range [9].It serves as a technique for effectively mapping data across various scales.Among the diverse array of normalization techniques available, the z-score method stands out.Equation ( 1) visually represents the z-score normalization technique [5]. ( where X represents the value that has been observed, referred to as the original data.Mean denotes the average value, while Std signifies the standard deviation value.The corresponding z-score will adopt a positive value when the value exceeds the average.Conversely, if the value falls below the average, the z-score will assume a negative value.The ReLink dataset encompasses a range that varies for each feature.Consequently, it becomes imperative to perform a normalization process.

Cross Validation
The introduction of cross-validation reduces overfitting in random sampling on data sets.Cross-validation, a popular machine learning approach, divides the original dataset into training and test data.This section aims to ensure successful training and reliable assessment of classification models using training and test data.Data is divided into K subsets, commonly known as validation sets, with K set to 10 by default.K Fold Cross Validation is a cyclic procedure repeated on each validation subset.This procedure ensures that each subset is used only once as the validation set, with the remaining subsets used as training data.The drawback of K Fold Cross Validation is that data sharing is not proportionate, leading to potential data loss, especially when imbalanced data is used [10].In order to prevent errors from imbalanced data, we utilize stratified K fold (SKF) cross-validation (CV) to distribute the data evenly.Stratified 10 Fold cross-validation ensures that the distribution of data samples is equal across classes and that all instances are tested [11].

Learning and Hyperparameter Tuning
The Decision Tree (DT) is a widely used and successful approach that finds applications in several disciplines, including machine learning, classification, image processing, and pattern detection.The model's output is determined by sequentially traversing a tree structure consisting of decision nodes [12].The primary objective of the DT is to construct a training model capable of making predictions regarding the value or variable of a target based on decision rules derived from training data.As its name suggests, the Decision Tree is portrayed as a hierarchical structure consisting of three distinct types of nodes.Embarking from a root node, which symbolizes the initial point of the tree's formation n, the internal nodes serve as pivotal points for branching.In contrast, the leaf node, representing the ultimate node, pertains to the class label.The classification process is executed by segregating the tree branches, whereby each division corresponds to a test conducted on a specific attribute.This branching process persists until the terminal level is attained, wherein the data tuples of each node solely comprise samples belonging to a singular class.The algorithm concludes the partitioning process once the training tuples are exclusively assigned to a single class.Finally, the leaf nodes furnish predictions for the class.The configuration of the Decision Tree model is visually depicted in Figure 2.  Deep forests consist of layer-by-layer structures known as cascade forests.The structure of each layer in the cascade forest resembles the backpropagation of DNNs, with the distinction that it contains multiple Random Forests instead of neurons [3].Cascade forest refers to a class distribution each tree generates for every instance.These distributions are calculated by employing the ratios of the various classes within the instances.
Consequently, a class vector is obtained from the average distribution of classes across all the trees and forests.The deep forest algorithm follows a layered and overlaid flow process at each layer level.The first layer receives input from attributes or features in the original dataset, which are then processed alongside Random Forest in the subsequent layer.The layer terminates either when the process generated by Random Forest no longer improves or when the result on the given layer decreases.From each existing layer level, the algorithm evenly distributes the results from layer to layer until the last layer is reached and the maximum value 101 is obtained.Despite taking longer than Random Forest, Deep Forest performs better when dealing with small-scale data [13].The depiction of the cascade forest's groove can be observed in Figure 4.

Figure 4 Deep Forest Structure
Hyperparameter tuning refers to searching for the most optimal values for a set of parameters, wherein one must initially specify a list of parameters and their corresponding search ranges [14].When selecting a tuning strategy, it becomes necessary to identify the list of parameters and their respective search ranges while also considering the option of utilizing default values.In the context of DT, RF, and DF, these models possess sets of hyperparameters that can be configured.Specifically, in the case of the Decision Tree, the parameter min_samples_leaf determines the minimum number of samples required at a leaf node.Another parameter, min_samples_split, controls the minimum number of samples to split internal vertices.
Furthermore, the parameter max_depth determines the depth of the tree, while min_impurity_decrease plays a role in regulating the growth of the tree based on impurity, which is assessed through metrics such as the Gini index and entropy [15].Like the Decision Tree algorithm, RF also has a parameter called max_depth that regulates the depth of the trees within the forest [16].Additionally, RF and DF have a parameter called n_estimators, which governs the number of trees in the forest.However, it is crucial to acknowledge a distinction in the role of n_estimators between DF and RF.Within the context of DF, this particular parameter governs the abundance of forest in each layer.In contrast, in the case of RF, it dictates the number of trees within each forest [8].
The process of Random Search commences with the random selection of hyperparameter pairs, which are subsequently used to train the model.Following this, the training results are recorded, and validation is conducted.These steps are repeated numerous times to generate multiple potential candidates.The validation scores of all the obtained candidates are then compared, and the highest scores are selected.This comparative analysis yields the optimal configuration of parameters.Figure 5 graphically depicts the steps of Random Search.appropriateness for assessing the predictive performance of datasets with imbalanced class issues [5].Table 2 provides comprehensive guidelines for the classification of AUC values.

RESULTS AND DISCUSSION
This section presents the research findings achieved through the stages of encoding labels.The ReLink dataset contains a module consisting of 649 label classes that have not undergone encoding based on the requirements of the model.Therefore, it is necessary to convert the label values in the ReLink dataset into binary numbers, specifically 0 for clean and 1 for defective.Following the label encoding stage, the normalization process is conducted using z-score, as outlined in Equation (1).In this process, if a value exceeds the average, the z-score is positive, while if it falls below the average, the z-score is negative.The data that has undergone normalization can be observed in Table 3, which displays the results of z-score normalization.The ReLink dataset was subsequently divided using cross-validation techniques, specifically employing data-sharing training and testing with stratified 10-fold cross-validation rules.Following this, a learning (classification) process was carried out.Classification was performed at this stage utilizing the Decision Tree, Random Forest, Deep Forest algorithms, and Random Search with hyperparameter tuning.The classification procedure encompassed the entire ReLink dataset, consisting of the Apache, Safe, and Zxing subsets.
A Random Search was conducted for 30 iterations, producing 30 optimal hyperparameter candidates.The parameter range for the model is provided in Table 4.The parameter search produces optimal parameters using Random Search, which gives the best prediction performance results showing the AUC value produced from each model.The results of evaluating optimal parameter values and AUC are shown in Table 5.The evaluation results on the Safe dataset produce optimal parameters and AUC, as seen in Table 6.The evaluation results on the Zxing dataset produce optimal parameters and AUC, as seen in Table 7.The comparison of the outcomes obtained from the previous studies indicated that the method of hyperparameter tuning Random Search, which was applied to the entire tree-based algorithm, successfully enhanced the prediction of software defects compared to the previous studies.This improvement was discovered by comparing the average performance of each suggested approach to that of the prior study method.Compared to the decision tree method with grid search tuning, Random Search revealed a 4% improvement.Furthermore, the suggested random forest approach, which combines hyperparameter tuning RS with hyperparameter tuning Grid Search (GS), outperformed grid search hyperparameter tuning by 3%.The deep forest approach used Random Search hyperparameter tweaking and produced the same AUC value of 0.79 as earlier researchers who used previously investigated hyperparameters.
In a previous study, hyperparameters were adjusted using the grid search method.The leaf node parameters were set from 1 to 10, internal nodes from 2 to 10, tree depth from 1 to 10, and impurity from 0 to 3. The decision tree algorithm classification resulted in an average AUC value of 0.69.Furthermore, the classification using the random forest algorithm with grid search tuning had the parameters set from 100 to 500 for the number of trees and from 1 to 5 for the tree depth, resulting in an average AUC value of 0.76.Lastly, the deep forest classification with grid search tuning had the parameters set from 2 to 11 for the number of forests and from 100 to 2000 for the number of trees, yielding an average AUC value of 0.79.
In this study, the hyperparameter tuning was performed using a Random Search with the parameters outlined in Table 4 for the decision tree classification algorithm, resulting in an average AUC value of 0.73.For the random forest algorithm, the average AUC value obtained was 0.79.Similarly, the deep forest algorithm yielded an average AUC value of 0.79.A comparison of the AUC values is presented in Table 9.This particular study focuses on predicting software defects in ReLink datasets.The prediction is done using tree-based categorization models, namely DT, RF, and DF.The models are further enhanced by conducting hyperparameter tuning using the Random Search technique.The performance of these models is found to vary, as indicated by the results obtained from various trials and comparisons.Applying hyperparameter tuning using Random Search improves the AUC metric's performance significantly.This improvement is particularly noteworthy compared to previous studies that utilized hyperparameter tuning with Grid Search for tree-based classification.
Interestingly, the Random Search approach outperforms other studies that employed NB, LR, and SVM with default parameter configurations.Regarding RF and DF, the RF parameters mainly focus on the number of trees within a specified range, typically from 100 to 1000.On the other hand, DF introduces an additional parameter, namely the number of forests, with possible values of 3, 4, and 5.In this study, the average AUC value achieved by the RF model was 0.72, while the DF model performed slightly better with an average AUC of 0.73.
The results of this research clearly show that using hyperparameter tweaking with Random Search in combination with the RF classification model produces improved results.This superiority is evident from the obtained average AUC value of 0.79.This result further solidifies the effectiveness of the Random Search technique in optimizing the parameter search process compared to the commonly used Grid Search approach in tree-based classification.
In future investigations, there is a possibility of incorporating a parameter candidate search range into the Random Search parameter search algorithm by expanding the parameter candidate range.The primary objective of this endeavour is to ascertain the degree to which the performance of Random Search can be optimized in terms of parameter search, thereby yielding superior performance values.By expanding the range within which potential parameter candidates are considered, researchers can delve into the realm of possibility and explore the potential benefits and drawbacks of this novel approach.

◼Figure 2
Figure 2 Decision Tree Structure

Figure 3
Figure 3 Random Forest Structure

Figure 5
Figure 5 Stages of a Random Search2.5 EvaluationThe classification performance of DT, RF, and DF models on each ReLink dataset is evaluated using AUC values.The selection of AUC as the evaluation method is based on its Tree, Random Forest, and Deep Forest are a few examples.Additionally, it elucidates the validation test conducted ◼Effect of Hyperparameter Tuning Using Random Search on … (Muhammad Hevny Rizky)

Table 3 Z
-score results

Table 4
Hyperparameter Search Ranges Effect of Hyperparameter Tuning Using Random Search on … (Muhammad Hevny Rizky)103

Table 8
illustrates the mean AUC outcomes derived from hyperparameter tuning via the Random Search approach across the complete range of ReLink datasets.The Random Forest model outperforms the Decision Tree model regarding average AUC values on the ReLink dataset.Deep Forest achieved an overall AUC value of 0.79 in the study.The performance on Safe data was particularly impressive, with a result of 0.86.Similarly, Random Forest had an average AUC score of 0.79, with its highest performance on Safe data at 0.86.In contrast, the Decision Tree model performed poorly compared to the other models.It AUC value of 0.73 for the generated dataset, with its highest performance on Safe data reaching 0.83.The overall performance of the average AUC values provided by Random Forest and Deep Forest is superior.

Table 9
Comparison of AUC ValuesTable10presents a comprehensive analysis of the overall outcomes compared to previous research endeavours that employed various methodologies, including LR, DF, SVM, NB, and RF.It becomes apparent that the findings derived from this investigation surpass those of its precursors.More specifically, the average AUC value achieved through utilising DT, RF, and DF classification techniques, employing the Random Search hyperparameter tuning approach, outperforms the average AUC value obtained by applying alternative methodologies.

Table 10
Comparison of AUC Results with Other Research Methods Effect of Hyperparameter Tuning Using Random Search on … (Muhammad Hevny Rizky) 105 4. CONCLUSIONS