Classification of Land Suitability For Soybean Crops Using The Cart Method and Feature Selection Using an Algorithm ABC

ABSTRACT


INTRODUCTION
Soybeans hold a prominent position as a priority food crop, following rice and corn.They are considered one of the most crucial commodities, with a consistently increasing demand within the country.According to the publication "Statistik Konsumsi Pangan Tahun 2020," the per capita consumption of soybean-containing food items exhibited an average annual growth rate of 8.21% foraging behavior of bees [8].By leveraging the ABC algorithm, the researchers were able to identify the most relevant attribute among the entire set of attributes, improving the overall effectiveness of the classification process.
In a study conducted by Indra Irawan et al [9] the influence of the Artificial Bee Colony (ABC) algorithm on enhancing the performance of the CART (Classification and Regression Tree) classification method was investigated.The results demonstrated a notable performance improvement, with an achieved accuracy of 82%.This accuracy surpassed that of alternative methods, indicating the superiority of the combined ABC algorithm and CART classification approach.The integration of these techniques occurred after the data preparation stage.Following data preparation, the feature selection process was performed using the ABC algorithm to identify the most relevant attributes for the classification task.This ensured that the classification process utilized the optimal attributes selected through the feature selection stage.Ultimately, the study successfully employed the CART method with feature selection using the ABC algorithm, improving classification performance.
Consequently, developing a classification system becomes imperative to address the challenge of determining land suitability for soybean cultivation.Such a system would facilitate the classification of land based on its suitability for soybean plants, thereby providing valuable recommendations for determining appropriate land use and effectively preserving land quality.Ultimately, the implementation of this classification system would contribute to the cultivation and productivity enhancement of soybean plants.

Figure 1. System Development Flowchart
Figure 1 above is a flowchart of the system development method in this study, where the system development method used is the CRISP-DM (Cross-Industry Standard Process for Data Mining) method.The stages of the system development method consist of Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment which are described below:

Bussiness Understanding
The business understanding stage is the stage in determining goals based on the business situation, eventually resulting in the correct problems and solutions in dealing with these problems [10].The issues raised and identified in this study are land use that is still not by the actual land potential; it can result in the incompatibility of planting land used in soybean planting so that land quality decreases, and lack of soybean cultivation due to lack of soybean land which results in production and productivity soybeans that have not been able to meet the needs of the community.
The offered solution is implementing a soybean land suitability classification system using the CART method and feature selection using the ABC algorithm to determine land suitable for soybean plants, and it is hoped that the resulting classification results can provide recommendations in determining and expanding soybean land use according to the actual land potential.

Data Understanding
Several things were done to understand data: collecting initial data, describing and exploring data, and verifying data.The following are the stages in data analysis :

Initial Data Collection
This study to collect data carried out by several methods, namely : a. Library Studies The data collection process for this study involved a comprehensive literature review.The finding was conducted to identify relevant criteria that underpin the evaluation of soybean land suitability.To obtain the necessary information, authoritative sources were consulted, including the book authored by Wahyunto et al [11] and the book authored by Ritung et al [12].Furthermore, a book focused on soybeans authored by Aidah and the KBM Indonesia Publishing Team [13] was utilized to deepen the understanding of soybeans.Additionally, scientific journals pertaining to soybean land suitability and the methods employed in this study were thoroughly examined to gather pertinent data.2. Data Description and Exploration Data description and exploration were conducted to analyze and comprehend the structure of the initially collected data, aiming to generate tables that adhere to predetermined criteria for soybean land suitability.At the initial data collection stage, 111 data points were acquired from Central Bengkulu, Kepahiang, Lebong, and Mukomuko Regencies.These data encompassed 33 attributes along with one target attribute.
While some attributes corresponded to the criteria for soybean land suitability and were included in this study, not all attributes were utilized.This is due to certain factors, such as the requirement for further laboratory testing for several attributes and the absence of data for certain attributes.Consequently, these unused attributes were excluded from the study.Additionally, location attributes, such as mapping units, sub-districts, districts, and provinces, were incorporated, resulting in 20 attributes and one target attribute.Subsequently, the collected data was divided based on sub-districts within each respective mapping unit found on the land map.This division yielded an additional 135 data points, increasing the total dataset from the initial 111 data points to 247.
Furthermore, data exploration was conducted to analyze and assess various aspects of the data, including data types, volume, and other relevant characteristics.This exploration was performed after amalgamating the data obtained from Central Bengkulu, Kepahiang, Lebong, and Mukomuko Regencies, creating a unified dataset for further analysis.

Verify
Data verification is done to re-check the data processed at the data exploration stage.This verification stage is the stage after conducting data exploration, where data is obtained that will be used in the data mining process.In this study, the attributes used in the implementation of feature selection using the Artificial Bee Colony (ABC) algorithm and classification using the Classification And Regression Tree (CART) method are shown in Table 2  Base Saturation (%) attribute 10.
Evaluation Labels targeted attribute

Data Preparation
The stages of data preparation that will be carried out in this study include several steps, namely data cleaning and data split : 1. Conversion Data At the data conversion stage, several data conversions were carried out on attributes that have non-numeric values, namely drainage, texture, N Total, P2O5, dan K2O, and erosion hazard into numerical data represented by the numbers 0, 1, 2, etc.

Feature Selection
At the feature selection stage, feature selection is made from the existing attributes.For the mapping unit, sub-district, district, and province attributes, a correlation value is calculated where a negative correlation value is obtained for this attribute, which means that it does not affect the evaluation label attribute.The value for the mapping unit attribute is -0.190709; for the sub-district, the value is -0.221255; for the regency, the value is -0.074974; and for the province attribute, the value is NaN.

Data Cleaning
In dealing with the missing values contained in the data, it is handled by changing the value to "Unknown" so as not to cause conflicts in the data type used so that when checking the missing values, no NaN or empty values are found.

Split Data
In the data split stage, the data is divided into training and testing data.The distribution of data for each training data and testing data will be divided into two parts, namely 70% training data with 173 data and 30% testing data with 74 data.

Modelling
System flowcharts, feature selection flowcharts with the ABC algorithm, and classification flowcharts using the CART method are as follows : Figure 2. System Flowcharts First, the input data is permit location information: provinces, districts/cities, and subdistricts.As well as a dataset of several attributes of the needs/characteristics of soybean plant permits.Furthermore, the data preparation stage includes cleaning the data to recover lost or "unknown" values and obtaining inconsistent data.Furthermore, then split the data where the data is divided into 70% training data and 30% testing data.Then the training data can be directly classified using the CART (Classification And Regression Tree) method or perform feature selection using the ABC (Artificial Bee Colony) algorithm.
Furthermore, the best attribute method is obtained after feature selection, which is used in the classification process using CART (Classification And Regression Tree).After that, the model will be accepted as decision tree rules and decision trees.After that, the model is implemented on training and testing data to make predictions in the classification process, and the classification results will be obtained in the form of S1 (very suitable class), S2 (reasonably suitable class), S3 (suitable marginal class), and N (not according to class).After that, an evaluation is carried out by calculating the values of Sensitivity, Specificity, and y and calculating the Confusion Matrix.

Evaluation
This stage is where an evaluation is carried out and ensures that the resulting model is by the initial objectives that have been determined.If appropriate, it can proceed to the deployment stage; if it is not suitable, it can return to the business understanding stage [10].The calculation of the confusion matrix is an evaluation stage in this study by calculating the values of Sensitivity, Specificity, and y.

Deployment
At this stage, the process of making reports and journal articles is carried out by implementing the results of the research that has been done.It is also done at this stage by making UML (Unified Modeling Language).UML (Unified Modeling Language) is a language that has become an industry standard used to design a software system to provide a model and model visualization, and a modeling language without a programming language used for developing software systems [14].Some UML modeling such as usecase diagrams and class diagrams [15].As well as in this research, interface design, interface creation, and system development were carried out based on previously designed arrangements.

RESULTS AND ANALYSIS 3.1. Artificial Bee Colony (ABC)
Feature selection is selecting several features that affect accuracy the most.Also, feature selection is the best solution to improve the performance of classification methods by reducing irrelevant features [16].Feature selection in this study using the artificial bee colony (ABC) algorithm.The ABC (Artificial Bee Colony) algorithm was introduced by Karaboga and Basturk [17].The following are the steps in the optimization process for the ABC (Artificial Bee Colony) algorithm : 1. Initialization Phase, Initialization Phase is the phase of randomly determining the position of the food source using the following equation ( 1): 2. Employee Bee Phase, in this phase, the employee bee will explore the neighborhood of food sources related to the employee bee.The equation ( 2) of the neighborhood exploration is as follows : After v_ij is generated, the fitness value is generated using the following equation ( 3): After the employee bee has searched, the employee bee provides information regarding food sources to the onlooker bee.The probability value helps the onlooker bee to choose a food source to be explored next with the following equation ( 4): 3. Onlooker bee phase.In this phase, the food source with the best probability value will be selected by the onlooker bee to become an employee bee by using a roulette wheel (RW) for neighborhood exploration of preferred food sources described in the employee bee phase.4. In the scout bee phase, in this phase, the scout bee determines whether to renew the food source or not based on the LIMIT variable, which is updated on each exploration, which is if the LIMIT value > MAX LIMIT, then the food source will be updated with a new food source randomly by the scout bee.And if LIMIT < MAX LIMIT, then the food source will not be renewed.[18] Several iterations were carried out in the artificial bee colony (ABC) algorithm.From each number of iterations, three trials were carried out.Where the best attribute is taken from rank 8 with the highest number of occurrences, and if the following sequence has the same number as rank 8, then it will still be taken as the best attribute.With ten iterations, the best green attributes were produced as follows: Table 3 displays the feature selection results using the ABC algorithm with ten iterations and three trials because ten iterations produced the highest level of accuracy, namely 97.11%, compared to other iterations.The dataset used has 16 attributes subjected to feature selection using the ABC algorithm to produce several attributes that frequently appear in iterations.The attributes that appear are then added up and ranked to get the most influential attributes, namely the top 8 in green in Table 3, which are the attributes that will be used in the ABC+CART method.

Classification And Regression Tree (CART)
The decision tree technique known as CART (Classification And Regression Tree) aims to group data and also describe the dependent variable with the independent variable [5].The CART algorithm has several advantages, which are as follows: 1.Because it is a non-parametric method, the CART algorithm does not require a specification of any functional form.2. Since it is not required to determine variables at the start of the process, the CART algorithm selects the most significant variables and eliminates the insignificant variables.3. Outliers are data with different characteristics that appear as extreme values in either a single variable or a combination.This CART algorithm easily handles outliers.4. The CART algorithm is computationally fast and makes no assumptions.5.The CART algorithm is a flexible and can adjust according to needs.The CART algorithm has several drawbacks, which are as follows: 1.The CART algorithm can generate a hazardous decision tree.2. Data is divided by the CART algorithm based on only one variable.[19] IT Jou Res and Dev, Vol.9, No1, August 2024 : 1 -13

Classification of Land Suitability For Soybean Crops Using The Cart Method and Feature Selection Using An Algorithm ABC
According to Steinberg [9] in implementing the CART algorithm, there are several stages in forming a classification tree using the CART algorithm, which is described below: 1. Define root 2. Calculate the Gini index for each prospective branch 3. Choose the most extensive Gini index as the splitting attribute used as a branch.4. Repeat steps 2 and 3 until the leaves are close to pure.The attribute selection measure is crucial in determining the most suitable criteria for classifying training data in size.It is a heuristic method to select criteria that effectively partition or restrict the data, resulting in close-to-pure divisions.The attribute selection measure determines the order and ranking of attributes within the training data, with the attribute possessing the highest rank selected as the splitting attribute.In cases where the splitting attribute exhibits a constant value, a split point is determined.However, if the splitting attribute is discrete, the decision tree formed must adhere to a binary structure.
The Gini index serves as an attribute selection measure within the CART (Classification and Regression Tree) algorithm.It employs the binary properties of each attribute to generate divisions.The Gini index quantifies the impurity of a particular partition, denoted as D, by applying a formula (5).
The impurity of each resulting partition can be summed up for checking binary division.The formula (6) can be used to determine the Gini index D : Attribute A was chosen as the splitting subset because the attribute has discrete values so that the gini index value with the smallest value close to 0 is given by the subgroup.Each split point must be checked on continuous-valued attributes.The attribute values are sorted, and the midpoint between each attribute value pair can be used as the split point.The split point will then be the attribute with the smallest Gini index.The following formula can be used to determine the decrease in the level of impurity resulting from the binary division on attribute A : ∆Gini(A) =Gini(D) -〖Gini〗_A (D) (7) Splitting attributes will be chosen based on how maximally the attribute can reduce impurity.If the attribute has a discrete value or a split point, it will form the sharing criteria for splitting subsets.If the attribute has a continuous value, it will create a division criterion [20].
In this study, the CART method produces decision trees and decision tree rules that will be used as models in the classification process in data testing.Where the decision tree is shown as follows : IT

ABC+CART
This study also carried out the combination calculation process between the CART method and the ABC algorithm.In the calculation process using the combination method, an experiment was carried out with the attributes obtained at each iteration used in this study, namely iterations 5, 10, 25, 50, 75, and 100.So that the resulting decision tree is as follows: IT Jou Res and Dev, Vol.9, No1, August 2024 : 1 -13

Classification of Land Suitability For Soybean Crops Using The Cart Method and Feature Selection Using
An Algorithm ABC 11

Confusion Matrix
In the testing stage, the performance of the classification method is evaluated by comparing the original labels of the data with the labels obtained from the analysis.This evaluation is carried out using a confusion matrix, which is a data mining concept that provides values such as Sensitivity, Specificity, and Accuracy.Accuracy represents the percentage of correctly classified data, Specificity measures the ratio of correctly classified harmful data to the total harmful data, and Sensitivity represents the ratio of correctly classified positive data to the total positive data [19].The calculation of Accuracy, Specificity, and Sensitivity requires the use of several equations (8,9,10), as follows: 1. Sensitivity [21]  =   (6) 2. Specificity [21]  =   (7) 3. Accuracy [21]  =  +   +  (8) The evaluation of the machine learning model in this study involved a train/test split process.The data was divided into training data, comprising 70% of the total data (173 instances), and testing data, comprising 30% of the total data (74 instances).After the data division, the model was implemented, and the labels predicted by the model were compared with the original labels.Two methods were employed in this study: the classification and regression tree (CART) method and a combination of the CART method with the artificial bee colony (ABC) algorithm.The confusion matrix was then calculated for both methods, and the results were compared as follows in Table 5

10 Figure 3 .
Figure 3. CART Decision TreeThe decision tree rules generated from the CART method are shown in the following table:

Table 2 .
: Attributes in the Implementation of Feature Selection and Classification

Table 3 .
The Number of Attributes Appears 10 Iterations