TUBERCULOSIS CLASSIFICATION USING RANDOM FOREST WITH K-PROTOTYPE AS A METHOD TO OVERCOME MISSING VALUE

: Tuberculosis is a disease that attacks the core of the respiratory organs, which affects many people. This disease is one of the contributors to high mortality cases, especially in Indonesia. Based on its anatomical location, tuberculosis is divided into two classes, namely pulmonary for tuberculosis detected in lung parenchymal tissue and extrapulmonary for tuberculosis detected in organs other than the lungs. Detecting the location of the infection in the lungs requires some analysis of laboratory results for the triggering parameters where the analysis process is still done manually, so it takes longer, and because the input process is still done manually, patient data which causes the possibility of human error to be very high. Therefore, the solution offered and the aim of this study is the ease of patient diagnosis in determining the classification of TB disease. The method used in this study is k-prototype imputation to repair missing values that have different data types, then for tuberculosis data classification methods


INTRODUCTION
Every year the development of an increasingly sophisticated era is marked by increasing industrial technology in various fields, resulting in factories' high growth. However, it is undeniable that the rapid growth of factories in Indonesia has a negative impact, namely, air pollution produced after production makes air quality decrease so that it is not suitable for the health of the respiratory organs, namely the lungs. Bacteria or viruses will more easily infect unhealthy lungs; one example of a disease caused by a bacterial infection is tuberculosis. Tuberculosis (TB) is an infectious disease caused by Mycobacterium tuberculosis in the lungs and is one of the diseases with the highest death rate [1], [2], and [3] [4]. Based on Indonesian TB data in 2022, cases of death from tuberculosis are 93,000 people per year or the equivalent of 11 deaths in one hour [5]. Even though the number of deaths caused by TB has decreased from the previous year, you have to watch out for it precisely because this disease is contagious.
To data from the World Health Organization (2019), Indonesia is listed as one of the countries with the third highest level of TB cases, with a total of 842,000 or 46% of all cases, especially in East Java province, which ranks second in Indonesia in terms of highest TB cases with a total of 57,014 [6]. It can be seen that TB cases have increased from year to year in various regions due to a lack of socialization about the dangers of TB and how to prevent and treat it. This can be seen from how people underestimate health and do not have a complete treatment for TB disease, which increases this case in Indonesia. Therefore, it is necessary to utilize data mining techniques to build a classification system that can facilitate diagnosing TB disease so that treatment can be carried out immediately. 3

TUBERCULOSIS CLASSIFICATION
In data mining, structured and complete data is needed for accurate results. However, not all raw data can be processed immediately because there is noise or an empty value called the missing value. In general, datasets in the medical field have incomplete data [7]. A missing value can reduce the accuracy of data [8] [9]. Therefore, it is necessary to take an approach to overcome the problem of missing values. Overcoming the missing value can be done by ignoring the missing value during the analysis, or you can also do the imputation. Imputation is a technique of replacing missing values with values obtained from a method [10].
One of the technologies that are currently popular is machine learning. In machine learning, classification methods play a role in various fields, one of which is the health sector. In this field, machine learning can predict disease and present medical diagnosis data. Many machine learning methods are used to classify and analyze diseases, one of which is the Random forest. In previous research on Coronary Artery Disease classification, which discussed a comparison of several machine learning methods, the best accuracy results were obtained when applying the Support Vector Machines, K-Nearest Neighbors, Neural Network, and Random Forest algorithms [11]. Therefore, this research will classify using three different methods, namely, Naïve Bayes, SVM, LSTM, and Backpropagation, to find out which method best performs categorizing data involving the imputation process using the K-Prototype method.

PRELIMINARIES
System errors, such as no response to sensors or input receiving devices, as well as human errors when entering data, are common things that often occur so that there is incompleteness in data which causes missing data. In data mining, some methods can only be processed when the data has complete features or data, and therefore special handling is needed for missing data. The following are the methods used to deal with missing data problems: Case Deletion, Parameter Estimation, and Imputation Techniques.
Case deletion is the simplest method, namely by deleting data that contains missing data so that it is not used in further processing. However, this method has a weakness because some important 4 ROCHMAN, MISWANTO, SUPRAJITNO, KAMILAH, RACHMAD, SANTOSA information is deleted. The imputation technique is a technique that is widely used because, in this technique, no data is deleted for nothing. The way this technique works is to estimate missing data by obtaining patterns from data that have complete features. Mean, Median, and clustering are some of the most frequently used imputation methods [12].
Data mining can be known as the process of mining data information or disclosing information from printed data sets [13]. Data mining is characterized as a process of extracting data where clients communicate with various reports involving analytical tools as part of information mining.
The purpose of data mining is to obtain valuable data from a set of reports so that the information sources used in data mining are different data whose arrangement is unstructured or perhaps semicoordinated. Based on the tasks that can be performed, data mining is grouped into several sections, namely description, prediction, estimation, classification, clustering, and association [14]. Below are the stages in data mining shown in Figure 1.

Data Preprocessing
The data preprocessing process in this study begins with data transformation to change the categorical data type to numeric so that computation can be carried out. Then, next is the data imputation process to overcome the missing value condition. The final data preprocessing stage is normalization for scaling the range of values between data.

A. Data Transformation
The data transformation process is the stage for changing the data type into another data type [15].
Data transformation aims to simplify the classification process. In this study, the transformation is carried out by changing the values in the dataset in the form of categorical data converted into numeric data.

1.
A missing value is a condition where there are several attribute values in data that have no value or are missing. There are several causes for missing values, one of which is due to an error during the data entry process [16]. Therefore, it is necessary to apply an algorithmic method to overcome missing value conditions, one of which is the imputation process.
2. Imputation is a process that can handle the phenomenon of missing values in datasets by filling in missing data using new values based on data that has complete attributes or other information available from the dataset [17]. This imputation process can handle missing values better than other methods. In carrying out the imputation process, several algorithm methods can be applied, one of which is the K-Prototype.
3. K-Prototype is an algorithm that is commonly used in grouping data with mixed data types, namely numeric and categorical [18]. This method works by combining distance calculations in the k-means algorithm, namely the euclidean distance and dissimilarity measures found in k-modes [19]. The steps for imputing using the K-Prototype algorithm are as follows: a. Determine the number of group parts to be used. c. Calculate the distance of each data with the selected center value; calculate the distance using the following formula: Description: 1 : distance size limit for numeric attributes : distance size limit for category attributes : data weight at the center of the cluster (centroid) : data weight in the cluster you want to find the distance 4. Grouping data based on the smallest distance.
5. Determine the new centroid value using the average value of numeric data and using the mode of categorical data. Here is the formula: Description: : total data in the cluster, : total in each cluster, Repeat steps three to five until no member of the cluster has moved.

Normalization
Data normalization is a scaling technique that aims to change values between data that have the same range of values [20]. By using the normalization technique, the dataset will have a new field or range of values so that no data is too large or too small to simplify the statistical analysis process.
Min-Max Normalization is one method that is commonly applied in carrying out data normalization processes [21]. The equation below is the formula for the min-max normalization Description: x' : normalization result, x : actual value of variable x min(x) : minimum value max(x) : maximum value

Data Mining Process
The data mining process involves the process of dividing training data and test data before entering the classification process. Training data is needed in the learning process of the classification model, while test data is used to evaluate the results of the model obtained when the learning process is carried out. The data-sharing process is carried out using several methods, one of which is by using the k-Fold Cross Validation method. This method uses the value of k to determine the number of partitions. Below is an illustration of the use of the value of k = 5 [21].

A. Random Forest
Random Forest is a method that makes modeling using a collection of several decision trees. The method is included in the supervised learning algorithm, which can classify data based on samples and attributes of the training data. Random forest is also one of the algorithms that use ensemble techniques by applying bagging and random feature selection methods. The ensemble learning applied to this method is useful for reducing the problem of less stable classifications by combining some basic learning to reduce prediction errors [22]. The stages of the Random Forest algorithm are as follows.
1. The first stage is inputting the dataset and then bootstrapping the data to create a subset by taking random samples with a return size of n from the training data.
2. The second stage is the process of random feature selection to build a tree until it reaches the maximum size.
3. The third stage is to calculate the value of all features to build a tree using the entropy formula below.

B. Support Vector Machine
Support Vector Machine was invented by Vapnik in 1992 as an AI strategy that works with standard SRM or Structural Risk Minimization. Support Vector Machine hopes to determine the best hyperplane that will isolate two classes in the information space. A hyperplane is a separator between the two classes that aims to maximize the distance (margin) between data classes [23] and [24]. To find the optimal separator function (classifier) and can separate two different classes, the best hyperplane must be found among the unlimited number of other hyperplanes. A good hyperplane if it is located right between two sets of objects from two classes. Figure 3 shows how SVM maximizes the distance between two different sets of classes (margins) by determining the best hyperplane.

D. Backpropagation
Backpropagation is a supervised learning algorithm that uses several layers to update or change the weight values connected to several neurons in the hidden layer [26]. This algorithm works by minimizing the error value in the network's output results by updating the weight value in the backward phase using the output error value. The output error value is obtained after performing calculations in the forward phase. In this algorithm, the learning process is carried out in two phases, namely the forward propagation and backward propagation stages. The following is the flow of the backpropagation algorithm in each phase.

Phase 1: Forward
1. Initialize the weights using small random values, max Epoch, error, and learning rate.
2. If the epoch value is smaller than the maximum value, do steps 3 to 4. This condition will repeat until it meets the requirements. 13

TUBERCULOSIS CLASSIFICATION
3. Any data contained in the input layer will receive a signal and be forwarded to the hidden layer. 4. Calculate all outputs in the hidden layer (j = 1,2, … ,p), using the following formula: Calculate the change in weight using the formula below. = ɑ ; = 1,2, … , ; = 0,1, … .
Calculate the change in weight using the formula below. = ɑ ; = 1,2, … , ; = 0,1, … . Description: : changes in weight values in the output layer ∶ change in weight values in the input layer

Evaluation
The system performance evaluation process is needed to measure how well the method used is.
The technique commonly used in evaluating the system is the confusion matrix. The confusion matrix can produce calculations of the level of accuracy, precision results, recall, and f1 score [27].
The following shows a representation of the confusion matrix in Table 1.  The confusion matrix results will calculate the AUC value, accuracy, recall, precision, and f1 score using the following formula. The confusion matrix results will calculate the AUC value, accuracy, recall, precision, and f1 score using the following formula.
1. AUC (area under the curve), or it can be called probability, is a method for calculating under the ROC curve. The higher the AUC value, the classification method used can be applied properly in a study. The following is the calculation formula [27]: The following is a table of classification accuracy with AUC values: 2. Accuracy is the percentage of the predicted classification result data that is correct compared to all the input data. The accuracy formula can be written as follows: 3. A recall is the number of reviews that are correctly classified as positive divided by the number of positive reviews in the dataset. Recall can be written and calculated using the following formula: 4. Precision has the same meaning as recall, it's just that precision is used to calculate the negative class. Here is the formula: 5. F-measure is the result of mean recall and precision, where the range of the f1 score itself is 0-1.

A. Data collection
In the study, the dataset used was information on patients diagnosed with tuberculosis, as many as 985 records with six attributes which included age, sex, chest X-ray, HIV status, history of diabetes, and TCM results. In the raw data, there are still several empty values or so-called missing values, which will later be resolved by applying the imputation technique using the K-Prototype method. Missing values in the dataset can be seen in Table 3.   . describes the IPO diagram as follows:

Input
The input section is the initial part. Namely, there is a data input process. The data is the result of records from the medical records of patients with tuberculosis, which consists of 7 attributes.

Preprocessing
This stage aims to change the raw data to be more structured to facilitate and increase the accuracy in classifying data. In this study, there were several stages of data preprocessing, including the following:

a. Transformation
Data transformation is the initial stage in the data preprocessing carried out in this study.
This stage is necessary to convert categorical data into numeric data so that it can be computed.

b. Imputation
After transforming the data, the next step is to perform the imputation technique to handle missing values using the K-Prototype algorithm.

c. Normalization
The new dataset obtained from the imputation process will be normalized before entering the classification process, this is necessary so that the range or range of values in the data is not too far away and a 0 to 1 value range scaling is required.

Data Sharing Process
The process of dividing the data in this study applies the K-fold Cross Validation method with k = 5 and k = 10.

Classification
In the classification process, a learning process is carried out to obtain a classification model using several different methods, namely Support Vector Machine, Backpropagation, and 19 TUBERCULOSIS CLASSIFICATION Random Forest. The other classification models will be compared with the accuracy results to obtain optimal classification results.

Outputs
The resulting output is a class prediction of TB attributes based on the modeling method proposed in this study.

A. Imputation
Data preprocessing is crucial before data is processed or processed in Machine Learning (ML).
Data preprocessing is used so the system can properly process the data. This preprocessing stage includes several things, namely, filling in the missing values because the presence of missing values can affect the results of the dataset classification itself. The preprocessing stages carried out in this study were imputation or filling in the missing values using the K-Prototype Imputation method. Before that, do a way to change categorical data and normalize numeric data in the columns 'Age,' 'Gender', and 'TCM Results'. There are several ways to do categorical encoding using label encoding and one hot encoding. In its implementation, the use of encoding labels for categorical data that has more than two types of values can be misinterpreted by the algorithm as having some hierarchy or sequence. One-hot encoding is a technique that changes each value in a column as a new column and fills it with a binary value, namely 0 or 1. An example of data results after a one-hot encoding process: Determining the number of clusters in this study is K = 3, while the distance between data to each centroid using the K-Prototype formula because there are 2 types of data, namely categorical and numerical data in calculating distances. Grouping data according to the minimum distance from the centroid. Perform iterations to determine the new centroid until the centroid and the number of members does not change.

B. Random Forest
The classification process using the Random Forest method involves dividing the dataset using the K-Fold Cross Validation method using the values k=5 and k=10. The parameter of the number of trees used is 10. Following are the results of the evaluation using the confusion matrix method that has been carried out on the tuberculosis dataset of 985 records.  Table 4 shows the test results using two different k-fold values where the accuracy of the two results are not much different where the use of k-fold = 5 is slightly better than k-fold = 10, this is indicated by the Classification Accuracy value when using k-fold = 5 reaches 98.6 % and 97.4 % on k-fold = 10. However, the AUC value using k-fold = 10 is better, 98.8 %.

C. Support Vector Machine
The results of testing SVM modeling on the RBF kernel using the k-fold method to divide training data and test data with k-fold values = 5 and 10. As for the C parameter value used, 1, and an error tolerance value of 0.0010 with a maximum of 100 iterations, The following results of the evaluation of SVM modeling on the RBF kernel are shown in the table below.   Table 5 shows the test results using two different k-fold values where the two accuracy results have the same value. This is indicated by the Classification Accuracy value when k-fold = 5 and 10, which reaches 97.4%. However, the AUC value using k-fold = 5 is superior to k-fold = 10, which is 95.3%.

D. Backpropagation
Testing on Backpropagation modeling begins by dividing the dataset into test data and training data using the k-fold method with values of k = 5 and 10. As for the number of neurons, as many as 100 layers and the activation, the function uses ReLu with SGD and Adam optimizer. Then, for a learning rate value of 0.0001 with some iterations (epochs) of 200. The following are the evaluation results of modeling using the Backpropagation algorithm shown in Table 6.   Figure 5 show the ROC graphs comparing the AUC value to the Random Forest, SVM, and Backpropagation methods. Based on the ROC graph both on the use of k-folds 5 and 10, it is found that modeling using the Random Forest method has the best level of accuracy, which is indicated by the point line that represents the AUC Random Forest value that is closest to the number 1 because the AUC value is closer to number 1 then this method has excellent accuracy based on Table 4. So, in this case, the Random Forest method has the best accuracy in classifying the TB dataset.

CONCLUSION
Based on the analysis and discussion that has been carried out on the results of the classification evaluation by dividing training and testing data using 5-fold and 10-fold on 985 records with seven 23 TUBERCULOSIS CLASSIFICATION attributes, namely age, sex, district, chest X-ray, HIV status, history of diabetes, outcome TCM (Rapid Molecular Test), it is concluded that: 1. Using K-Prototype imputation with K=3 can overcome gaps in data with conditions of different data types.
2. Using k-fold values 5 and 10 does not provide a significant difference as shown in Tables 2   and 3. However, from the evaluation results, the use of k-fold five values is slightly superior to k-fold 10 when implemented in research.
3. The application of different classification methods gives different evaluation results. Where based on the discussion of the performance evaluation results for each of these methods, it is known that in this case, the Random Forest method with a k-fold value = 5 has a better performance value compared to the Support method Vector Machines and Backpropagation.

ACKNOWLEDGMENT
The author would like to thank Universitas Airlangga Indonesia, which has facilitated doctoral education. As well as the University of Trunojoyo Madura, Indonesia, where the author resides as a teacher.