Decision Support System for Classification of Early Childhood Diseases Using Principal Component Analysis and K-Nearest Neighbors Classifier

Background: Data on early childhood disease collected in clinics has accumulated into big data. Those data can be used for classification of early childhood diseases to help medical staff in diagnosing diseases that attack early childhoods. Objective: This study aims to apply Principal Component Analysis (PCA) and K-Nearest Neighbor (K-NN) Classifier for the classification of early childhood diseases. Methods: Data analysis was performed using PCA to obtain variables that had a major influence on the classification of early childhood diseases. PCA was done by observing the correlation between variables and eliminating variables that have little influence on classification. Furthermore, data on early childhood disease was classified using the K-Nearest Neighbor Classifier method. Results: The results of system evaluation using 150 test data indicated that the classification system by applying PCA and KNN Classifier had an accuracy value of 86%. Conclusion: PCA can be used to reduce the number of variables involved so that it can improve system performance in terms of efficiency. In addition, the application of PCA and KNN can also improve accuracy in the classification of early childhood diseases.


I. INTRODUCTION
Early childhood are assets of the nation that need good education and care for improving the quality of the nation [1].Early childhood mortality rate each year is about 12.4 million [2].Improving children's health is one of the most important programs of the government.
During the growth period, children are susceptible to various diseases.The diseases are mostly caused by germs or viruses that experience direct contact with children.The symptoms that often accompany early childhood diseases are fever, cough, and diarrhea [3][4] [5] [6].Proper handling is needed for children to be healthy again without any side effects.
Children health clinic provides services and treatments for early childhood.Every patient who comes is recorded in the medical record.The medical record data is collected in clinical storage, so it becomes a big data.This causes the medical staff difficult in processing the patient data stack.The medical staff diagnose the disease and give treatment based on Manajemen Terpadu Balita Sakit (MTBS).
MTBS is a comprehensive program that deal with sick children come for basic health care service [7].MTBS is an embodiment of the international program of WHO and UNICEF to integrate services for early childhood diseases, initially called MTBS.MTBS classifies several classes of early childhood disease based on complaints of cough, diarrhea and fever.Each class has a list of symptoms suffered by children.
Symptoms suffered by children can be processed, so that medical staff can determine the type of disease and type of treatment [8] [9].In addition, there are other factors that also affect the disease that children suffer.Medical record data generally includes data on body weight, height, age, body temperature and so on.Collection of data in medical records can be used to extract information [8], so that the types of diseases suffered by patients can be known [10].
Many studies have been conducted on the diagnosis of early childhood disease, one of which is [9] who states that forward chaining can be used to diagnose early childhood disease.However, this study only uses symptom 14 factors and ignores other factors, such as height, weight, age, sex, and body temperature.These factors can be used to diagnose disease [11].
The patient's medical record consists of various features.Features reduction is very important to identify the most significant risk factors associated with disease [12].Feature reduction is an efficient data pre-processing technique in data mining to reduce data dimensions [13] [14] [15].
PCA is one of the well-known statistical techniques that aims to reduce the dimensions of data without losing important information in data [16] [17].PCA basically converts and decomposes a large number of correlated variables into a small number of uncorrelated variables and can reduce the dimensions of data [18].PCA has several advantages such as reducing data redundancy, reducing complexity, reducing database size, and reducing noise.PCA can be used to determine correlations between variables [18].
Classification is a technique used to find unknown data classes [19].Various methods are known for classification, such as decision trees, rule based, K-NN classifier, and others [20].K-NN classifier is a method used to classify new objects based on their nearest neighbors.K-NN became well known among the data mining techniques because of its simplicity and relatively high speed of convergence.K-NN is also called lazy learning because it did not go through training phase and memory based classification because the training sample must be in memory while the process was running [11].
Jabbar [11] implemented K-NN with Feature Subset Selection to determine variables that contribute more to disease prediction.This method can indirectly reduce the number of tests that must be taken by patient.This prediction model can help doctors in an efficient decision-making process with fewer variables to diagnose heart disease.
In those studies, the accuracy of the diagnosis depends on the features used in the diagnosis of early childhood disease.Therefore, it is very important to develop a systematic scheme that is able to determine the most representative features to maximize the accuracy of diagnoses in early childhood diseases.In this paper we investigate the application of K-NN with feature reduction to classify early childhood diseases.

A. Data collection
The data collection methods used in this study were direct interviews with sources and literature studies.It aims to gain knowledge about the assessment and classification of early childhood diseases.Interviews were conducted on three doctors and two nurses.Patients were all children diagnosed with the disease at the children's health clinic where data was collected.Data was obtained from patient data at one of the children's health clinics in Surabaya for 3 years, from 2015 to 2017.

B. Data analysis
The data obtained was divided into two parts, i.e. training data (70%) and testing data (30%).PCA analysis was used to know the correlation between variables data.K-NN classifier was used to determine the classification of early childhood diseases.Based on the reference of MTBS, the early childhood diseases were classified into 16 diagnoses and 26 symptoms.Data analysis consisted of two analysis, i.e.Principal Component Analysis (PCA), and K-Nearest Neighbor (K-NN) Classifier.
1) Principal Component Analysis (PCA) PCA analysis was used to discover the correlation between data variables.Data variables were age, weight, height, body temperature, sex, and 26 symptoms.The symptoms were cough, diarrhea, fever, inability to drink or suckle, vomiting, unconsciousness, fast breathing, breathing difficulty, Stridor, liquid or soft defecating, hollowed eyes, poor abdominal skin turgor, fussiness/ irritability, abnormal thirst, nausea, diarrhea of 14 days or more, blood in feces, stiff neck, rash, red eyes, turbidity on the cornea, mouth ulcer, festering eyes, fever for 2 to 7 days, high and continuous sudden fever.16 diagnosis were cough, pneumonia, severe pneumonia, diarrhea, mild dehydration diarrhea, severe dehydration diarrhea, persistent diarrhea, severe diarrhea, dysentery, common fever, severe fever, measles, measles with severe complication, measles with complication, fever may be Dengue Hemorrhagic Fever (DHF), DHF, and fever isn't DHF.
By looking at the correlation between these variables would be obtained the factors that influence the early childhood diseases.The PCA steps were used as follows: 1. Calculation of the variance Calculation of the variance used (1).

Calculation of the covariance
Calculation of covariance used (2).After that, the covariance matrix was generated.
3. Calculation of the eigenvalues and eigenvectors Calculation of the eigenvalues and eigenvectors used (3).After the eigenvalues of the covariance matrix was known, the eigenvectors of each eigenvalue was calculated by using ( 4).
4. Calculation of the principal component.
After the principal component was known, the correlation between the principal variables and components will be calculated using (5).Reducing variables by eliminating the low components.1. Variables obtained from PCA were variables that influence classification.2. The dataset was normalized using the min-max normalization in (6).
3. Calculating the distance by using Euclidean distance in (7).
4. Generating the class of testing data.The weighted voting function was used to calculate the weight of each class.The class with the greatest value was included in its class.Weighted voting used (8).
C. System Design The system design displayed the flow of the application of Principal Component Analysis and k-Nearest Neighbor for classification of early childhood diseases in the form of flowchart diagrams.Flowcharts were made for two systems, i.e. the early childhood diseases classification system by using PCA and KNN Classifier, and the early childhood diseases classification system by using KNN Classifier.

D. System Implementation
At this stage, the system would be built based on system design and implemented into a web-based system using PHP programming language.MySQL Database was used as a storage for early childhood diseases data.

E. System Testing
System testing was required to find out whether the system is working properly and correctly.System testing was done by black box testing techniques using data obtained from the Children Clinic.

F. System Evaluation
System evaluation compared the output of system by applying PCA and K-NN classifier, with K-NN classifier without PCA.Each system output was compared against the original data to obtain their respective accuracy.After obtaining the accuracy of each output, it would be compared to know which one was better between classification accuracy using PCA and K-NN classifier, or using K-NN classifier.

A. Data and Information Collection
Data collection techniques used in this study were interviews and literature studies.The results of the interview were information about how to classify early childhood diseases based on symptoms and complaints suffered by patients, and how to obtain patient data.The PCA and KNN classification literature studies were obtained from books in libraries, e-books, and scientific journals.A literature study was also conducted to find out more about the MTBS which can be found in the MTBS modules and related scientific journals.

B. Data Analysis
The data used were 500 data.The data was divided into two parts, namely 350 training data (70%) and 150 testing data (30%).Training data was used to determine the variables that influence the classification process using PCA.After getting the influencing variable, the classification process was carried out.K-NN was used for the classification process using testing data.
1) Principal Component Analysis (PCA) PCA analysis was used to determine the correlation between data variables.Data analyzed using PCA was only training data, as a system knowledge base.The steps in PCA were as follows: 1. Determination of the variables to be analyzed.

Creation of covariance matrix.
Covariance matrix could be created by calculating the covariance values between data variables.For each variable, value of the relationship between the variables themselves as well as other variables were calculated according to Eq. 2. For example, cov (X1, X1) was the covariance of variable X1 against X2, cov (X1, X2) was the covariance of variable X1 against X2.The same formula used for X2, X3, X4…….X31.Example of covariance calculation for cov (X1, X2):

Calculation of eigenvalues and eigenvectors
After the covariance matrix was formed, the eigenvalue and vector was calculated.The eigenvalue and eigenvector could be completed using the website: www.comnuan.com.Calculation of eigenvalues and eigenvectors for X1, X2, X3, and X4 could be seen in Table 1.

Determination of Principal Components
The eigenvector above was used to determine the Principal Components.From these calculations could be made four Principal Components formed by multiplying the eigenvector with X variable.

Find the correlation between variables and Principal Components.
Each Principal Component formed was correlated with all variables in accordance with (8).The following was an example of correlation calculation for Y1 and X1.The solution of the above equation was converted into a matrix form, so that a correlation matrix was generated.By observing the matrix, variables that were less influential for the classification process were known.The example of the correlation matrix could be seen in Table 2 below.Based on the matrix above, the variables chosen were whose value is above 0.5.The age variable (X1) was not used for classification variables.Variables used for classification were weight (X2), height (X3), and body temperature (X4).
Based on the results of experiments with the same calculation above using training data, then the variables obtained were 18 variables, i.e. weight, sex, cough, flu, diarrhea, fever, vomiting, inability to drink or suckle, seizures, breathing difficulty, unconsciousness, stridor, blood in the feces, hollowed eyes, poor abdominal skin turgor, fussiness / irritability, diarrhea 14 days or more, turbidity on the cornea.This variable would be used for the classification process. 2

) K-Nearest Neighbor (K-NN) Classifier
The way the KNN classifier works was to calculate the distance between data testing against training data.The steps of the KNN Classifier were as follows: 1. Data Normalization Data normalization was carried out on training data and testing data.The Sex variable has two values, i.e. "Male" and "Female", normalized to "1" and "0"."1" for male and "0" for female.For Symptom variables, the values "Yes" and "None" are normalized to be "1" and "0".The value "1" for normalizing the value "Yes" and "0" for the value "None".The equation min-max normalization was used for the variable weight, height, and body temperature.The equation used the highest value and the smallest value of training data for each variable.
Example for calculation of min-max normalization data of patient "Y" who had 20 months of age, weight = 10 kg, height = 79 cm and body temperature 37.5 ° .The Maximum and Minimum values of training data could be seen in Table 3.

Calculation of Euclidean distance
Similarity level of data to a class was determined by its distance.The smaller the distance, the greater the resemblance to a class.Euclidean Distance equation was used to measure the similarity distance.The distance for each testing data to training data was calculated.
Example of calculating Euclidean distance for "Z" patient and "Y" patient by applying (7).The distance values for "Z" and "Y" was 1.76339.The same formula was used to calculate the distance of "Z" patient to another patient.

Weighted Voting
Errors in predicting classes could occur even though the proximity distance had been calculated.Those was because there was noise or data that deviates far from the original class.
Weighted Voting was needed to determine the class location from new data.Voting was done by weighting each class.By calculating the total value of Euclidean Distance for each class, the final result of the classification process can be determined.
The output of this study were 16 classes of early childhood diseases, i.e. cough, pneumonia, severe pneumonia, diarrhea, mild dehydration diarrhea, severe dehydration diarrhea, persistent diarrhea, severe persistent diarrhea, dysentery, fever, fever with common risk sign, measles, measles with severe complication, measles with complication, fever may be DHF, DHF, fever isn't DHF.The calculation below was an example of weighted voting for the "Cough" class.There are 45 data that had "cough" diagnosis from 400 training data.Equation ( 8) was applied to get the "Cough" class score from the "Z" patient.Other early childhood diseases classes were also applied the same calculation with different values.The results of weighted voting for all classes of early childhood diseases were shown in Table 4.The "Cough" class had a score of "146.8976105", which was the highest score compared to other classes.So it could be concluded that "Z" patient suffer "Cough".

C. System Design
The system design of the PCA and K-NN Classifier for the classification of early childhood diseases was described using System Flow Diagram.Two scenarios are applied to prove that PCA can reduce features and improve the accuracy of classification.The first scenario, the user can use two features of the system, which is entering new data and viewing the results of testing.The new data entered will be training data.Users can see the results of testing by entering testing data.The system will process the testing data and display the results of the classification.The System Flow Diagram for the system classification for early childhood disease using K-NN can be seen in Fig. 1.The second scenario is the same as the first, but the testing data is managed using feature reduction.The System Flow Diagram for the system classification for early childhood by using K-NN and PCA can be seen in Fig. 2. The form that displays the classification results can be seen in Fig. 3.

D. System Implementation
System implementation was a web-based application.This system was built using Hypertext Preprocessor (PHP) programming language, Hypertext Markup Language (HTML), JavaScript, and MySQL as a database management system.The tools used to build the system were Adobe Dreamweaver, Bracket, and XAMPP.

E. System Testing
Black Box Testing was done for system testing to test how well the performance of the application made.Black Box testing was done by comparing the expected results with the results issued by the system.Testing was carried out by applying PCA and KNN Classifier for classification of early childhood diseases.The form of KNN classifier result could be seen in Fig. 3.
Fig. 4 The result of system evaluation F. System Evaluation System evaluation was carried out on the two systems built, i.e. classification system using PCA and K-NN classifier and classification system using K-NN classifier.500 data obtained were processed for both systems, 350 data were used as training data and 150 data were used as testing data.Based on the results of system testing, 14 data were labelled as true for classification using PCA and KNN, while 18 data were labelled as true for classification using KNN without PCA.Result accuracy was calculated by dividing the number of correct data to all data, then multiplied it by 100%.The result was correct if the expected result and the result issued by r , = Correlation between sample X and principal component y e = Eigenvektor s = Covariance matrix 2) K-Nearest Neighbor (K-NN) Classifier K-NN classifier was used to determine the classification of early childhood diseases.Variables used for classification were variables obtained from PCA. KNN classifier used weighted voting to calculate the weight of each class.Steps of KNN classifier were as follows: ov(X , X ) = (20 − 23.41)(10 − 10.89) + ⋯ + (12 − 23.41)(9 − 10.89) (500 − 1) = 42.334

18 Fig. 1
Fig. 1 System flow diagram for classification using K-NN

Fig. 3
Fig.3The form that displays the classification results

TABLE 4 WEIGHTED
Fig. 2 System flow diagram for classification using PCA and K-NN