Benchmarking Naïve Bayes and ID3 Algorithm for Prediction Student Scholarship

Student scholarship through Indonesian smart cards (ISC) is a cash donation scholarship for all student in range 6-21 years old as a solution of destitute child or potential dropouts. Currently the large student data stored in educational database. The data was used to determine feasible receiver ISC. Manual prediction by human take long time and potentially human error. This paper aim to predict feasible receiver ISC by using educational data mining. Source of data come from a senior high school database in Riau Province, Indonesia. In this paper we compared two algorithms (Naïve Bayes and ID3) to predict receiver ISC. Eighty percent process in this paper is data pre-processing. Information of other scholarship is the most influences variable to predict student scholarship. ID3 algorithm classification has accuracy 83 percent and f1 score 71 percent. While Naive Bayes Classification has accuracy 89 percent and f1 score 72 percent. In this case Naïve Bayes algorithm is better in ID3 algorithm.


Introduction
Smart Indonesia Program through Indonesian smart cards (ISC) is a scholarship to all student in range 6-21 years old that destitute family or potential dropouts from Indonesia government. School through teacher proposed feasible candidate to Indonesia government and as a feedback the government give the scholarship in form cash of money. If a school has many amount of student then teacher manually classify the receiver as a candidate. This condition is not ideal, because the process takes long time and probably subjectively classification. This would cause the target beneficiaries in less appropriately.
One of solution to solve the problem is applied educational data mining. Educational data mining is applied task of data mining on education data based. There are four task of data mining, these are classification, clustering, association rule and prediction [1]. The most power full of task mining to predict based on class is by using classification techniques.
The Naïve Bayes classifier being a technique based on probabilities is applicable to a wide range of domains according to the above literature survey. It is computationally fast, has linear complexity and does not require a large amount of training data. Furthermore the algorithm justifies the membership of an instance to a particular class with the help of probabilities. The main objective of this paper is to use data mining methodologies to determine child who receive ISC. Data mining provides many tasks that could be used to study the student scholarship parameter. There are some IOP Publishing doi:10.1088/1757-899X/1232/1/012002 2 classifiers that determine a child receive ISC, these are income of parent (father and mother), transportation, jobs, previous scholarship and room and board. In this research, there are two classification algorithms that used to predict student scholarship, these are Decision tree (ID3) and Naive Bayes. The evaluation process for the algorithm used accuracy, precision, recall and F1 measure.

Related Work
Many applications of data mining have been applied in real world, such as for business intelligent, decision making, education, etc. [2]. In education sector, several researcher have been implemented data mining techniques to solve problem or find pattern on educational data, such us Jalota, Chitra and Agrawal and Rashmi (2019) use kalboard 360 dataset which lies in the domain of education and gathered using learning management system (LMS) to predict performance level of students [3] on WEKA. They used J48, SVM, Random Forest, Multilayer perceptron and Naive Bayes algorithm. The result show that Multilayer perceptron had best performance for the predict data set.
Previous research to predict student performance has been proposed Devasia, Tismy and Vinushree, TP and Hegde and Vinayak in 2016 they used Naive Bayes algorithm to predict potential dropout of student in higher education [4]. In line, Ahmed, Abeer Badr El Din and Elarab Ibrahim Sayed in 2014 also had publish their research. They used ID3 algorithm on data set from a student's database used in one of the educational institutions, on the sampling method of Information system department from session 2005 to 2010 [5].
Classification algorithm also to analyze Students' Behavior. Abu Tair, Mohammed M and El-Halees and Alaa M analyzed educational data to figure out the reasons behind students' behaviors and make a decision for solutions and treatment paths [6]. They map student behavior into 12 behaviors.
In addition, many researcher have been implement Naive Bayes algorithm to predict some real case, Naive Bayes for predicting purchase [7], Naive Bayes to predict slow learners in education sector [8].
The Naïve Bayes algorithm has been applied to educational data mining in different contexts. Therefore Naïve Bayes classifier is suitable for effective classification of student feedback for faculty on class room delivery as class room consists of smaller data set [9].

Design Process
Generally, data mining process start from availability of data, feature selection, pre-processing data, data mining, evaluation and the end is we have the knowledge from the data shown in figure 1.

Data Pre-processing
In this research, source of data come from receiver ISC data at a senior high school in Riau Province, Indonesia. Initially size of the data is 882 records, seven attributes, and two class (table 1). There are 778 class not and 104 class yes.
Data preprocessing are critical step in data mining process. In this step only those fields were selected which were required for data mining. A few derived variables were selected by using feature selection, transform the attribute into categorical type and handling the missing value. Next step is splitting data used K-fold, where k=5 into train and testing.

Classification Method
There are many classification algorithm. Each algorithm has advantages and disadvantage. Each algorithm is not always power full to all kind of data, such us many researcher use decision tree for categorical data. A decision tree is a flow-chart-like tree structure, where the tree structure have root and leaf/child. Root/child are denoted by rectangles, it mean the attributes. While leaf are denoted by circles, it mean the decision or class. All internal nodes have two or more child nodes. All internal nodes contain splits, which test the value of an expression of the attributes [5].

Iterative Dichotomiser 3 (ID3)
: is one of existing decision tree algorithm that using Shannon Information Theory. The basic idea of ID3 algorithm is to construct the decision tree by employing a top-down, greedy search through the given sets to test each attribute at every tree node. In order to select the attribute that is most useful for classifying a given sets, we introduce a metric -information gain. To find an optimal way to classify a learning set we need some function which provides the most balanced splitting. The information gain metric is such a function. Given a data table that contains

Naive Bayes:
The Naive Bayes algorithm is a simple probabilistic classifier that computes a set of probabilities by summing the frequency and value combinations of the given data set [7]. Naive Bayes is a kind of classifier which uses the Bayes theorem. It predicts membership probabilities for each class such as the probability that given record or data point belongs to a particular class (equation 3). The class with the highest probability is considered as the most likely class. Naive Bayes Classifier assumes that all the features are unrelated to each other to study their individual effect on feedback. Presence or absence of a feature does not influence the presence or absence of any other feature. Though these may depend on each other and on existence of other features all these features are considered to independently contribute to the probability that it is a valid feedback. The hypothesis is tested for given multiple evidences (features).

Evaluation Method:
Most important question that arises is how good is model to solve our problem. Thus to evaluated the performance of ID3 and Naive Bayes algorithm used Accuracy, Precision, Recall and F1 measure [3].

Result and Discussion
In pre-processing step, we remove typo of data and noise symbol, analyse variable in education database use correlation value (Table 2). Only domicile variable has linear influence with class and other variable are not. Other scholarship information is highest influence than other variable to determine receiver ISC.

Classification using ID3 Algorithm
In this part, Data were classified by ID3 algorithm and running in Python. Generally, ID3 algorithm able to predict feasible student scholarship with accuracy 83 percent. In more, performance of ID3 algorithm shown on Table 3. the algorithm has good result for predict not class, shown on high precision, recall and F1 measure.

Classification using Naïve Bayes
Next experiment, Data were classified by Naive Bayes algorithm and running in Python. Naïve bayes algorithm able to predict feasible ICS receiver with accuracy 89 percent. Performance of the algorithm shown on table 4. The algorithm performance is not significant different with ID3 algorithm. Classification using Naive Bayes has high precision, recall and F1 to predict the class. High precision show that the algorithm correctly predict the positive class for ISC receiver data. In addition model classification able to observe actual class. However, researcher analyse that imbalance data has not been well handle. Thus precision, recall and F1 for class not is highest than class yes.

Compared both algorithm
Based on both experiment, both algorithm able to predict feasible ISC receiver. Although performance of both algorithm is similar. But based on four parameters performance, Naïve bayes algorithm is higher than ID3 (table 5). In other word Naive Bayes algorithm is better than ID3 for classifying receiver ISC.

Conclusion
In this paper, There are seven variable in classification process and the most influences is other scholarship to predict feasible ISC receiver. Naïve Bayes and ID3 algorithm able to predict receiver ISC, which data come from educational databased. Naïve Bayes is batter than ID3 to predict receiver ISC. This study will help school to determine receiver ISC in quickly. Our next hypothesis is handling imbalance data will increase the performance both algorithms