BEANS CLASSIFICATION USING DECISION TREE AND RANDOM FOREST WITH RANDOMIZED SEARCH HYPERPARAMETER TUNING

: Dry-beans are a food with high protein. Dry-beans can be used as processed food products for emergency conditions such as famine, natural disasters, and war. Dry-beans can be used as a long-lasting product. To identify types of beans, manual work certainly requires a lot of time and effort. Therefore, creating a system that can classify beans in a computerized system is necessary. In this study, we classified beans using public data from Koklu. The data consists of sixteen features, seven classes with 13,611 rows. The data for each class of bean is unbalanced, so it is necessary to carry out a balanced dataset using random oversampling. Machine learning for classification using Decision Tree and Random Forest. Apart from that, hyperparameter tuning with randomize search for the number of trees 50, 75, 150, 200, and 300. The test results show that the Random Forest’s accuracy, precision, recall, and f1-score reach 0.9658 respectively. The best parameter number of trees is 300


INTRODUCTION
Beans are a plant product that can be used as processed food.The food potential of Beans adds nutrients to the daily menu.Beans contain high protein, vitamins B, minerals, and fiber.Beans can be used for emergency food programs during natural disasters, long dry seasons, fires, and war [1]- [3].
Globally, there are more than 1,300 species of beans, but only about 20 are consumed by humans.
Among these beans are dry-beans, which are low in fat, low in sodium, and do not contain cholesterol.Dry-beans are cheaper than animal food products.Also, if stored properly, the product can have a longer lifespan than animal, fruit, and vegetable products.Dry-beans plants can also fix nitrogen in the soil and air [2].Production and harvest area for dry beans 2020 is 27.5 metric tons and 34.8 hectares.Dry-beans production has increased by 60%, and harvested area has increased by 36% since 1990 [4].
Choosing the type of dry-beans as a processed food ingredient requires precision.Manual processes certainly require physical and visual stability.If the number of types of beans that must be identified is large, a computerized system is necessary.Computer vision is a field that can fulfill this role-research using computer vision on the classification and identification of types of beans using Koklu public data.The total data is 13,611 grains with seven different types of beans.Data was split using 10-fold cross-validation.Classification uses machine learning methods: Multi-layer perceptron (MLP), Support vector Machine (SVM), k-nearest Neighbor (kNN), and Decision Tree (DT).The test results show that the accuracy is 0.9173, 0.9313, 0.8792, and 0.9252, respectively [5].
Other research using the Koklu dataset uses random undersampling.The machine learning classification methods include Logistic Regression, Random Forest, XGBoost, and CatBoost.Test results show the best accuracy using Xboost with 0.938 [6].
Subsequent research used the same beans dataset with machine learning classification methods, including Multinomial naïve Bayes, Support vector Machine, Decision Tree, Random Forest, Voting Classifier, and Artificial neural network.Experimental results show an accuracy between 0.8835 and 0.9361 [7].Other research using k-nearest neighbor, Decision Tree, SVM, and MLP produces an accuracy of 0.9030, 0.9083, 0.9223, and 0.9249.The study used the same dataset from BEANS CLASSIFICATION USING DECISION TREE AND RANDOM FOREST Koklu [8].
The results of previous research still need to improve performance.For this reason, this research carried out stages such as balanced data for each class and hyperparameter tuning to optimize classification results.

METHODS
This research has stages including Exploratory Data Analysis (EDA), preprocessing by carrying out a balanced dataset, and classification using Decision Tree and Random Forest.Apart from that, carry out optimization using randomized search.The complete steps are shown in Figure 1.

A. Input Dataset
The Koklu dry-beans data has 13,611 rows, 16 geometric features, and beans species labels.There are seven classes of dry-beans: Barbunya, Bombay, Cali, Dermason, Horoz, Seker, and Sira.Each species has a different amount of data.The amount of data in each class is shown in Table 1 [5].The public data used has imbalanced data for each class.The class with the highest data is Dermason 3,546 and the lowest is Bombay 522.

B. Exploratory Data Analysis (EDA)
EDA aims to determine the characteristics and analysis of data.This stage is carried out before modeling occurs.Generally, EDA give information about [9], [10]: 1.The total amount of data, the number of classes, the amount of data for each class, and the number of features.
2. Data type for each feature.The data type can be numeric or categorical 3. Missing value.In the data, are there any features that have null values?
4. Data duplication.How much data duplication does there exist?Drop duplicated data 5. Correlation between features.What is the degree of correlation between features?A high correlation indicates a close relationship between features.
6. Data outliers.Are there any outlier data?Data that is significantly different in value from other data.

C. Balanced dataset with Oversampling
The amount of data in each class is different in the beans dataset.The smallest category is Bombay, with 522 data, while Dermason has 3546 data.Small amounts of data have the effect of less learning, while large amounts of data can have better learning.This, of course, causes an imbalance in learning between classes.
Classes with more data can perform better recognition, while classes with small data do the opposite.
Therefore, it is necessary to balance data between classes so the system can carry out the same learning for each category.Oversampling is a method to overcome class imbalance.Data in small classes is increased by randomly doubling existing data [11]- [13].Oversampling visualization is shown in Figure 2.

D. Classification using Decision Tree and Random Forest (RF)
Decision Tree is a supervised learning that use for classification and regression.It has hierarchical model that consist of root node, branches, and leaf nodes.The equations used are generally information gain and entropy.This is to determine the features that will become root nodes, branches, and leaf nodes.The commonly used Decision Tree models are ID3, C4.5, and C5.0 [14], [15].
A random forest consists of multiple trees.Random forest is a method that uses ensemble learning techniques.Ensembles combine various models.There are two types of ensemble: bagging and boosting.Bagging performs multiple models in parallel, and the final output is based on majority voting.Random Forest is included in the bagging principle.The Random Forest algorithm can be described as follows [12], [16]- [18]: 1. Select a random sample from the provided dataset.
2. Create a Decision Tree for each selected sample.Then, you will get the prediction results from each Decision Tree created.
3. A voting process is carried out for each prediction result.For classification problems, use the modus (the value that occurs most often).
4. The algorithm will choose the prediction result that has been selected the most (most votes) as the final prediction.
RF has a characteristic: firstly, not all attributes/features/variables are used for each tree.Every tree is different.Second, the feature space is reduced because not all features are used in each tree.
Third, work in parallel.Each tree is created with different data and attributes.Fourth, there is no need to split training and testing data in RF because there is always 30% of data not used by the decision tree.Fifth, it has stability because the results are based on majority voting or average [12].

E. Hyperparameter tuning in RF with randomized search
In machine learning, some optimizations occur to improve performance.One thing that can be done is by hyperparameter tuning.In conventional programming, each hyperparameter is tried one by one the existing combinations.The initial hyperparameters were tested with varying values.
The hyperparameters in RF include the number of trees, maximum features/attributes/variables, minimum number of leaves, criterion (entropy/gini impurity/log loss), and maximum leaf node on each tree.Various combinations of hyperparameters were tested one by one.Of course, this requires significant resources if many combinations of hyperparameter values exist.
One solution to overcome this problem is randomized search (RS).The RS technique selects a combination of values for each hyperparameter randomly.So, not all combinations of hyperparameter values are executed, as in Grid Search.Therefore, there is a reduction in the resources required by the system because not all combinations of hyperparameter values are used [19]- [21].

F. Performance system
System performance is measured using a confusion matrix.Because the data has more than two classes, it is included in multiclass classification.The confusion matrix for multiclass is shown in    The initial data component comprises 13,611 rows with 17 columns (16 attributes and one label).
For data types, most of the 14 features are float, two features are integer, and one label is object.
Meanwhile, when checking duplicated data, there were 68 identical data and no missing values.
Next, the feature correlation produces six features with high correlation values between 0.83 and 1.00.

B. Experiment Scenario
This research has four scenarios, as shown in Table 3.The scenario consists of four methods: imbalanced and balanced classes with a Decision Tree and imbalanced and balanced classes with a Random Forest.Decision Tree is used as a comparison because the random forest backbone is a tree.For the number of trees (n_estimators used are 50,75,100,150, 200 and 300)

C. Result
The data in the testing scenario consists of two parts, namely training and testing, with a percentage  Table 4a shows the results of Decision Tree classification testing with imbalance classes.The test results show an accuracy of 0.8915, an average precision of 0.9070, an average recall of 0.9081, and an average f1-score of 0.9075.Meanwhile, the weighted average is between 0.8915 to 0.8918.
The highest classification results were in the Bombay class, while the lowest were in the Sira class.Meanwhile, the weighted average is between 0.9210.The highest classification results were in the Bombay class, while the lowest were in the Sira class.Table 5a shows the results of Decision Tree classification testing with balanced classes.The test results show an accuracy of 0.9569, an average precision of 0.9569, an average recall of 0.9569, and an average f1-score of 0.9568.Meanwhile, the weighted average is between 0.9568 to 0.9569.
The highest classification results were in the Bombay class, while the lowest were in the Sira class.The confusion matrix in Figure 5 shows that as many as

D. Discussion
The proposed method uses balanced data with oversampling, classification using Decision Tree, and Random Forest.The test results show that classification using Random Forest with balanced data achieves better results than Decision Tree.Random Forest classification with oversampling obtained an accuracy of 0.9658, while Decision Tree with oversampling reached 0.9569.
In another part, hyperparameter tuning with Randomized Search uses various values for the number of trees.Tuning allows all variations of the number of trees to be run simultaneously rather than tested individually.The results of the Randomized Search show that the optimal number of trees is 300.
Initialize the number of trees:

Output results:
Best Parameter: {'n_estimators': 300} In the final section, we compare the proposed method with previous research, which used the same drybeans data from Koklu.The comparison results are shown in Table 6.

Figure 3 .
Figure 3. Multiclass Confusion matrix of 70:30.Total data after drop duplicated 13,543.For training data, 9,480, and for testing data, 4,063.The results of testing using a Decision Tree with imbalanced and balanced classes are shown in Tables 4a, 4b and Figures4a and 4b

Figure 4 Figure 4 .
Figure 4 shows the confusion matrix from test results using Decision Tree and Random Forest with Imbalance Classes.

Table 1 .
Data rows each class

Table 2
are the EDA results.

Table 4a .
Testing Result of Decision Tree with Imbalance Classes

Table 4b .
Testing Result of Random Forest with Imbalance Classes

Table 4b
are the results of the Random Forest imbalance classes classification.Testing accuracy up to 0.9210, average precision 0.9326, average recall 0.9308, and average f1-score 0.9317.

Table 5a .
Testing Result of Decision Tree with Balance Classes

Table 5b .
Testing Result of Random Forest with Balance Classes The highest classification results were in the Bombay class, while the lowest were in the Sira class.BEANS CLASSIFICATION USING DECISION TREE AND RANDOM FOREST

Table 6 .
Comparison with previous researchA classification system for beans has been created using the Decision Tree and Random Forest methods with oversampling balance classes.The performance of the Decision Tree testing results shows accuracy, precision, recall, and f1-score of 0.9569.Meanwhile, the Random Forest test results showed accuracy, precision, recall, and f1-score of 0.9658.
ACKNOWLEDGMENTThis research was funded by the Penelitian Mandiri, University of Trunojoyo Madura, National Collaborative Research Scheme 2023.