Improving Neural Network Performance with Feature Selection Using Pearson Correlation Method for Diabetes Disease Detection

.


I. INTRODUCTION
Diabetes is known as a silent killer disease caused by a high level of insulin in the blood. The pancreas produced insulin to fight high blood sugar. The abnormal blood sugar level can affect internal organs and neurons in the human body. This thing has become a world concern that makes this disease a serious disease category [1].
According to WHO data, the number of diabetic patients increased to 108 in 1989, and 422 million in 2014 [2]. Diabetic patients in Indonesia also increase up to 2% from 2013 to 2018 [3]. The keep increasing patients become an extra burden for the medics [4]. On the economical side, this disease caused $13,700 or more extra burden for this sector [1]. That is why diabetes disease must be detected accurately [5]. By developing an accurate detection for diabetes, the medical workers can receive primary analysis as anticipation toward disease. Besides that, early analysis can reduce care costs [4].
Previous research on diabetes has been carried out, one of which is Verikas and Bacauskiene. The researchers explained that it is important to predict diabetes disease using the Neural Network approach. However, this approach experienced a problem caused by incorrect attribute selections. To solve this problem, the researchers utilized the Neural Network Feature Selector method to increase the prediction accuracy.
Behloul et al also studied diabetes disease, they examined the performance of the Neural Network which tends to be slow due to high dimensional diabetes data. However, this problem is solved by implementing a Fuzzy algorithm to choose the important attribute. Thus, the accuracy increased from 75.78% to 79.42%. The next diabetes research by Mehrbackh Nilashi et al that used the Classification Neural Network method to find important factors through preprocessing to increase efficiency and accuracy [6]. Research that discussed the dependency problem in the diabetes data, successfully solve the problem by utilizing Feed Forward Neural Network and Recurrent Neural Network. The result from the research found that the RMSE FNN and RNN are 0.652 and 0.624 with AUC 84.94% and 86.67% [7].
Neural Network algorithm can generalize or identify the non-linear relations in the data. That is why this algorithm is implemented in many applications and researches [8]. However, behind the advantages of Neural networks lies the slow training process [9].
Besides that, Neural networks often experienced slow performance caused by interference in the attribute identification process [10]. To understand which attribute is important and relevant, a proper feature selection is the only way to do. This is one of many important aspects that increase Neural network performance [11].
There is an algorithm called Pearson Correlation that can measure information between attributes and their labels. Besides that, this algorithm capable to handle mixed-type attributes efficiently. Because of that reason, this algorithm is suitable to be used as feature selection in the Neural Network algorithm [12].
Data preprocessing is an important technic to increase the quality of the data. By increasing the quality, the accuracy and efficiency also increase. The algorithm performance is also affected by the quality of the data, which means interferences will occur less than before. The most common problems that occurred in Neural networks are caused by the quality of data.
That is why this research proposed a Neural Network approach to detect diabetes disease with Pearson Correlation to increase the quality of the data. This is important to understand how the Pearson Correlation method can increase the performance of the Neural Network algorithm.

II. METHOD
This research used several steps from data gathering, feature selection using Pearson Correlation, classification using Backpropagation Neural Network, and accuracy evaluations. The used steps in this research will be described as follows:

A. Data Gathering
The data used in this research is a collection of patient data with diabetes disease. The data source of this data is categorized as secondary data since the user data is obtained from existing data [13]. The secondary data is taken from UCI Machine Learning Repository Diabetes Hospital 130-US with 101767 total number of rows [14]. Meanwhile, the total number of attribute data is 50 attributes divided into 49 predictor attributes and 1 label. The 50 attribute parameters presented in the following Table I. UCI Diabetes Hospital 130-US dataset contains 10 years' worth of data taken between 1999-2008 from 130 hospitals in the US. From 101767 records, only 100 randomly picked records were used in this research to ease the processing stage. agar data. Meanwhile, all 50 features in this research will be completely used and selected with the Pearson Correlation algorithm in the next step.

B. Implementation Methods
After data gathering, the next step is to select the feature from the randomly picked records with the Pearson Correlation method. In statistics, Pearson Correlation is a method to measure the relation between two variables.
It is based on a data covariance matrix to evaluate the strength of the relationship between two vectors [12]. Meanwhile, to calculate the Pearson Correlation used (1) as follows [15]: Where: r correlation coefficient is the product of the total number of variables x and y minus the total number of variables x times the total number of variables y divided by the root of the number of squares of the total number of variables x times the number of squares of the total number of variables y.
After the correlation coefficient value of each feature is obtained, the next step is to sort the correlation coefficient interpretation value from the highest value to the lowest value based on the following Guildford Table II.
From this feature selection process, the dominant features will be selected based on the highest weight calculation that has been calculated using the Pearson Correlation algorithm. After the top dominant weight feature is obtained, the next step is to classify using the Neural Network Backpropagation algorithm.
The Backpropagation Neural Network algorithm uses a binary sigmoid function where values 0 to 1 are the desired output. The main problem of the Backpropagation Neural Network is the iteration uncertainty that must be done so that it takes a relatively long time. These uncertainties affect the value of the epoch process, to produce the desired conditions for iterations. Each of the researchers has a different opinion in determining the value of the parameter. In this study, the stages of the Backpropagation Neural Network algorithm are recognized as follows: The first step is to initialize the weight by setting the initial value with a random value. The next step is applying the maximum value of epoch, Target Error, and Learning Rate. In this research, the value of epoch is set to 0 and MSE to 1. The second step is to loop the process by one increment with criteria: 1) the value of epoch is less than maximum epoch (epoch < max epoch), 2) and MSE more than Target Error (MSE > Target Error). Each pair of elements will be carried out to the learning stage of Feed Forward, Backpropagation, and MSE.  The third is to calculate the Feed Forward in every input unit ( ) receive x signal and forward it to all units in the next layer (hidden layer). The fourth step is to sum all units in the hidden layer ( ) with weighted signal input using: ∑ . After that, use the activation function to calculate the signal output ( ) and send them to all units in the next layer (Output Layer). The fifth step is to sum the weighted input signal ( ) with every output unit ∑ . After that use the activation function to calculate the output signal ( ) and send them to the next layer.
The sixth step is to backpropagation all output units ( ) to receive target pattern related with the input pattern training and calculate the error information with ( ) ( <= MSE, then the process stop. If not stopped, the process will continue reading from the first data (i=1) until the calculated MSE <= MSE input is fulfilled. In the evaluation stage, the hidden and output layers will be re-calculated. The schematic of the implementation stage of the Backpropagation Neural Network is shown in the following Fig.1.

A. Research Data
The used data set consists of 101767 rows and 50 attributes. From 101767 rows, this research only used 100 rows to avoid long processing time (Table III). These 100 rows of data are randomly picked, then divided randomly into two data set with 70% dan 30% ratios. The 70% portion of the data is used as training data, and the rest is used as test data.
From this dataset, the data will be transformed or converted into numeric data. The data transformation is done by sorting ordinal data into numerical form, e.g. Female as 0 and Male as 1 (Table IV).

B. Feature Selection
From calculations using the Pearson Correlation method, the weighted values are obtained as shown in the Table V   After obtaining the correlation value of each attribute, the next step is to sort the correlation value from the largest value to the smallest value. Then determine the threshold level of importance (weight) of each of these attributes. Henceforth, attributes that have the same importance (weight) as the threshold or greater will still be used or maintained, but for attributes that have a level of importance or weight value that is smaller or below the threshold value, they will be ignored or will not be used in the process. subsequent calculations.
To determine the threshold, it is done by testing five times, the test results can be seen in the following Table  VI. From the experimental data, the highest accuracy lies in the number of attributes as many as 6 attributes with an accuracy value of 96.00%. So that the threshold that will be used is the attribute threshold to 6, which is worth 0.301. The next step is to reduce the attributes according to the calculation result of Pearson Correlation. The reducing process of attributes is done by threshold limit with value 0.301, any attributes with a lower value will be removed from the data. The final data only consists of 6 attributes as shown in the following Table VII.

C. Evaluation and Validation
To measure the model performance, this research calculates the level of the prediction level using the validation model. By understanding the model performance can help to optimize the parameter and choose the optimal algorithm. Cross-Validation is one of many validation models used in this research. This technic used to validate many training and test subset data repeatedly. Every iteration tests subset data with leftover data as training data.    IV. CONCLUSION Through this research, it can be seen that the accuracy value and the AUC value from the experiment using the Neural Network method based on Pearson Correlation have an increase compared to the results obtained from the Neural Network method experiment alone. This can be seen from the results of increasing the accuracy of the method by 1.07%. The increase in accuracy is calculated from the experimental results of the Neural Network method alone which only has an accuracy value of 94.93% to 96.00% when experimenting with the Neural Network method using the person correlation method as feature selection. Thus it can be concluded that the Pearson Correlation algorithm has succeeded in improving the performance of the Backpropagation algorithm. Neural Network by selecting important features on UCI Diabetes Hospital 130-US data. The feature selection carried out by the Pearson correlation algorithm is based on computational calculations that do not pay attention to the important features that are used in the medical world. Therefore, this research can still be developed, one of which is by choosing features that suit the needs of the medical world by working with existing medical experts. However, overall this research is following and in line with the aims of the researcher, which is to only measure the level of achievement of the algorithm used.