Thai Water Buffalo Disease Analysis with the Application of Feature Selection Technique and Multi-Layer Perceptron Neural Network

This research aims to develop an analysis model for diseases of the water buffalo with the application of the feature selection technique along with the Multi-Layer Perceptron Neural Network (MLP-NN). The data used for analysis were collected from books and documents related to water buffalo diseases and the official website of the Department of Livestock Development. The data consist of the characteristics of 6 water buffalo diseases, including anthrax, hemorrhagic septicemia, brucellosis, foot and mouth diseases, parasitic diseases, and mastitis. Since the amount of the collected data was limited, the synthetic minority over-sampling technique was also employed to adjust the imbalance dataset. The adjusted dataset was used to select the disease characteristics towards the application of two feature selection techniques, correlation-based feature selection and information gain. Subsequently, the selected features were then used for developing the analysis model for water buffalo diseases towards the use of the MLP-NN. The evaluation results given by 10-fold cross-validation, showed that the analysis model for water buffalo diseases developed by correlation-based feature selection and MLP-NN provided the highest level of effectiveness with an accuracy of 99.71%, precision of 99.70%, and recall of 99.72%, implying that the analysis model is effectively applicable. Keywords-water buffalo diseases; feature selection; multi-layer perceptron; neural network; synthetic minority over-sampling


INTRODUCTION
In Thailand, water buffaloes play an important role in the livestock economy. Thai water buffaloes are the most common farm animals in Asia and farmers typically use them for agricultural labor and as a source of food. To domesticate water buffaloes productively, it is vital to pay attention to their nutrition, habitats, sanitation, signs, behavior, and disease symptoms [1]. At present, water buffalo farmers are confronting many kinds of water buffalo diseases due to the seasonal changes, disease carriers, the lack of expertise and knowledge among farmers themselves, and the lack of experts who can analyze and diagnose these diseases. Moreover, the internet only offers basic information about the diseases gained from statistical surveys, resulting in the retrieval of incorrect or inadequate data for disease analysis and, ultimately, misunderstanding or incorrect analysis. Water buffalo farmers who lack attention may overlook the signs and symptoms of serious infectious diseases, which could be spread to other animals and cause sickness or even casualties [2][3].
The Multi-Layer Perceptron Neural Network (MLP-NN) is one of the most popular techniques used to classify complex data. This research aims to develop an analysis model for water buffalo diseases towards the application of the feature selection technique and MLP-NN. Two feature selection techniques were employed, Correlation-based Feature Selection (CFS) and Information Gain (IG). After selecting the features, the data were then used for developing the analysis model by using the MLP-NN. The developed model can be applied to the development of a water buffalo disease analysis system, which is expected to help farmers timely analyze the diseases.

A. Data Imbalance Resolution
This research applies the Synthetic Minority Over-sampling Technique (SMOTE) to resolve the data imbalance problem. SMOTE helps to resynthesize data by increasing the dataset's size with a small amount of class data [4] to be compatible with the biggest dataset. This is done by randomizing a value and calculating the distance between the selected value with other values to find the nearest value [5].

B. Feature Selection Techniques
The goal of feature selection is to select the most significant features of each dataset in order to synthesize the model rapidly and increase the effectiveness of data classification. In this study, two feature selection methods are employed: CFS is a feature selection method based on the relationship between the collections of features gained from the evaluation of feature prediction capacity used for data classification and irrelevant data management. CFS can rank the data subsets based on the data dimensions and select the data subsets based on the data dimensions with regard to high and low relationships between classes. Any irrelevant data or any data with a low level of relationship will be excluded. The same will occur with complex data dimensions which shall be excluded from the data dimensions with a high level of relationship. The formula for evaluating the subsets of CFS data dimensions is shown in (1) [6]: where k refers to the data dimension or features, S M refers to the value of S data dimension subset which composes of k data dimensions, cf r refers to the average value of the relationship between the variables and classes ( f S ∈ ), and ff r refers to the average value of the relationship of data dimension.

2) Information Gain (IG)
IG is a feature selection method in which the gain value of each mode is evaluated. If a node has the highest gain value, it will be chosen as the root node, and the rest of the data will be reassessed in order to find the gain value of the next node. The formula for finding the IG value is [7]: where Y refers to the feature value, which is a data class belonging to the {Y 1 , Y 2 , …, Y n } set where n is the number of features, X refers to the value of other features that are not classes ranging between {X 1 , X 2 , …, X n }, ( ; ) Gain Y X refers to the score value gained from sample randomization ranging between 0 and 1, ( ) H Y refers to the probability value gained from the randomization of Y samples, and ( | ) H Y X refers to the probability value gained from the randomization of Y samples when compared to X.

( ) H Y and ( | )
H Y X are calculated in (3) and (4), respectively: where ( ) i P Y y = refers to the probability value from y 1 to y k , ( ) i P X x = refers to the probability value from x 1 to x k , and k refers to the number of features.

C. Multi-Layer Perceptron Neural Network
The MLP-NN, illustrated in Figure 1, consists of an input layer, hidden layers, and an output layer [8]. In each layer, there are a collection of nodes. Possibly, there are more than one hidden layers [9]. The MLP-NN operates by inserting data into the input layer to estimate and deliver the results to the output layer. The estimation requires the sum total of input data multiplied by weight values, as shown in (5). After that, the output is used for the calculation with the sigmoid function, as shown in (6).
where n refers to the sum total of input P i multiplied by the weight W i and i refers to the number of inputs or weight value. In (6), x refers to the input value.
The output of the hidden layer is delivered to the output layer, where there is a comparison between the estimated and the target outputs. If there are different values that cannot be accepted, the outputs will get into the backpropagation process and go back to the hidden and input layer. Simultaneously, there is the weight adjustment process, which will find the most acceptable value after testeing with the data. Subsequently, the output is estimated with the sigmoid function once again [8].

D. Similar Studies
Authors in [13] compared the effectiveness of data imbalance resolution techniques by using diabetes patients' data. The research team compared four different methods, which include oversampling, undersampling, hybrid method, and SMOTE. Two data classification techniques were applied, Multinomial Logistic Regression Analysis and Decision Tree, to classify diabetes patients. The research findings showed that the combination of the data adjusted by SMOTE and the data classification with the decision tree technique provided the best results for the classification of the diabetes patients. Authors in [14] studied the classification of heart diseases using MLP-NN and IG as a feature selection method. The findings showed that the number of features could be reduced from 13 to 8, while the accuracy of the training dataset increased by 1.1% and the accuracy of the trial dataset increased by 0.82%. Authors in [15] studied the classification of ovarian cancer towards the application of SMOTE and MLP-NN. The findings indicated that the SMOTE technique could adjust the data balance, and after using the adjusted data to construct the model with MLP-NN, the model's effectiveness increased. The experiment results showed that the application of SMOTE+MLP provided a data classification accuracy of 96%, which was higher than the one gained from the application of SMOTE+RBF. Authors in [16] applied feature selection along with MLP to predict chronic diseases. The research findings showed that applying these two methods provided higher effectiveness in terms of chronic disease prediction than the application of Support Vector Machine (SVM) and Decision Tree. The developed model of the current research can be applied to the development of a water buffalo disease analysis system. Two feature selection techniques were employed, CFS and IG.

E. Effectiveness Evaluation
The effectiveness evaluated by the confusion matrix is an evaluation method of discriminants' accuracy, which means the discriminants can be classified in accordance with their genuine value. The accuracy can be calculated by (7) [10], while precision and recall can be calculated in (8) and (9) [11][12], with the values represented in Figure 2. 100 where TP refers to when the target class is "Yes," and the model predicts it as "Yes" (True Positive), FP refers to when the target class is "Yes," but the model predicts it as "No" (False Positive), TN refers to when the target class is "No," and the model predicts it as "No" (True Negative), and FN refers to when the target class is "No," but the model predicts it as "Yes" (False Negative). III. RESEARCH METHODOLOGY The research methodology for the analysis of Thai water buffalo diseases consists of 1) data collection and preparation, 2) data imbalance adjustment using SMOTE, 3) feature selection by CFS and IG, 4) model development using MLP-NN, and 5) model effectiveness evaluation (Figure 3).

A. Data Collection and Preparation
This research collected data from books and documents related to water buffalo diseases [3] and the Department of Livestock Development's official website. There are totally 480 records of data. The data involves information about six water buffalo disease classes, namely anthrax, hemorrhagic septicemia, brucellosis, foot and mouth diseases, parasitic diseases, and mastitis. These data were used for developing the disease analysis model. There are 33 attributes and six classes, as illustrated in Table I.  After collecting the data illustrated in Table I, the research team rechecked their reliability and accuracy to ensure that no attribute was incorrect or missing. For example, there are no data out of range for each attribute. Then, the data were converted into .CSV file format in order to be operated with the Weka version 3.9, as shown in Figure 4. B. Data Imbalance Adjustment using SMOTE Since the prepared data were found to be imbalanced in class/label, the research team decided to adjust the data imbalance of the datasets by increasing the number of datasets with a small size of classes. The best result was provided after increasing the k-nearest neighbor value from 1 to 5. It was experimentally found that k = 5 and randomSeed = 1 give the best result. The data size was then increased from 100% until the highest level of effectiveness could be gained (as evaluated by 10-fold cross-validation). The experiment results showed that the data could be balancing and upsizing to 300%. Thus, the new dataset increased to 528, 688, and 768 records for data balancing using SMOTE at 100%, 200%, and 300% of its original size respectively.

C. Feature Selection by CFS and IG
The data with 33 attributes and 6 classes were brought into the feature selection process using the CFS and the IG in Weka. In this work, there are 8 groups of data applied in this process, namely 1) the original dataset through the CFS, 2) the original dataset through IG, 3) 100% of SMOTE through CFS, 4) 200% of SMOTE through CFS, 5) 300% of SMOTE through CFS, 6) 100% of SMOTE through IG, 7) 200% of SMOTE through IG, and 8) 300% of SMOTE through IG. These resulting datasets will be used at the next step.

D. Model Development Using MLP neural network
After the data imbalance had been adjusted, the data were transferred to the learning process to construct the model by which the research team applied two feature selection techniques, CFS and IG, along with the MLP-NN. In this research, the input layer consisted of 33 neurons. The output layer consisted of 6 neurons. Therefore, the optimum parameters for the MLP-NN model set in Weka were: Hidden Layer = 4, Training Time = 500, Learning Rate = 0.3, Momentum = 0.2, and 21 epochs= 500. These values provided the highest level of effectiveness after being evaluated by 10fold cross-validation. Besides, the model was also generated from the original dataset that SMOTE, or CFS, or IG had not processed in any way, for effectiveness comparison with the other models that have undergone balancing and or feature selection. Thus, 9 models were built from the original dataset and 8 feature selection datasets.

IV. RESULTS
According to the experiment results, the most appropriate data size adjusted by SMOTE was 200%. Then, the features of the complete dataset were selected by CFS and IG. Afterwards, the MLP-NN was developed and its effectiveness was evaluated by 10-fold cross-validation, as illustrated in Table II and Figure 5.  Figure 5, the 10-fold validation data classification method of the SMOTE (200%)+CFS+MLP dataset provided an accuracy of 99.71%, a precision of 99.70%, and a recall of 99.72%. These were the highest values. The results of the effectiveness comparison between the outputs of CFS and IG given by the MLP-NN showed that after being adjusted by SMOTE, the CFS method provided better feature selection than the IG method.
V. CONCLUSION This research aimed to analyze the water buffalo diseases towards the application of feature selection techniques along with the MLP-NN. The data imbalance was adjusted by the SMOTE method. Two feature selection methods were employed: CFS and IG. After that, the data were classified by MLP-NN, and the model's effectiveness was evaluated by 10fold cross-validation. The research findings showed that the most suitable data size after the data imbalance adjustment was 200%. After using the obtained data to construct the model, it was found that the model whose data size was adjusted by SMOTE and developed by CFS and MLP-NN provided the highest level of effectiveness in data classification with an accuracy of 99.71%. So, the developed model can be applied to the development of an analysis system for water buffalo diseases. The results of this study conform to the research conducted in [15], in which the SMOTE+MLP method was applied for data classification and a high level of effectiveness was reached with an accuracy of 90%, and also are in accordance with [16], in applied feature selection techniques were which alongside the MLP for data classification and gained a higher level of effectiveness.