Implementation of decision tree using C5.0 algorithm in preference and electability survey results on regional head election in Aceh

The decision tree is one of the methods of classification in data mining. There are many algorithms used to construct the tree model; one of them is C5.0 algorithm. The tree model with C5.0 algorithm was carried out based on the survey result dataset of the preference and electability of the regional head selection pre-campaign year 2018 in one of the districts in Aceh. The datasets consisted of 5 predictor variables, i.e. sub-districts, age, main occupations, highest education, and attracting factors from regional head candidate candidates. Variable categories of decisions ranged from candidates A, B, C, and D. The distribution of datasets was divided into training data and testing data using the k-fold cross-validation method. The optimum tree model formation was based on the accuracy value of model and coefficient of Kappa. The result showed that the best tree model was constructed using testing data on S = 10. The accuracy of the model and the Kappa coefficient were 0.8427 and 0.7208, respectively. There were three rules generated with five nodes. The main predictor variable contributing to the optimum model was the attracting factor of candidates and sub-districts.


Introduction
Classification is a data mining method that predicts label variables based on criteria variables [1]. Decision trees, artificial neural networks, Bayesian classifiers, Bayesian networks are some of the methods for data classification [2]. The decision tree is the most popular method and is easier to use in terms of application and interpretation of classification results [3] [4].
The results of the classification of decision tree are presented in the form of tree diagrams [5], by grouping each object based on certain characteristics. These characteristics are derived from the attributes of the decision maker, in the form of categorical data, numerical data, or both [6].
Decision tree has developed in the preparation of its algorithm. Many algorithms that have been developed at this time, such as the ID3 algorithm (iterative dichotomiser 3), C4.5 algorithm, and C5.0 algorithm. The basis of all decision tree algorithms is the ID3 algorithm, whereas the C4.5 algorithm is the development of the ID3 algorithm, and the C5.0 algorithm is the development of the C4.5 algorithm [7].
The C 5.0 algorithm is already widely used today. One of the reasons for using the algorithm is to cope with the large data and the analysis process is faster [8]. This algorithm is also able to remove some attributes that do not have a major influence on classification classes [9].
This algorithm can be applied to any type of data [7]. One of the datasets that can be used is data survey preference and electability of regional head selection in one of the regencies/cities in Aceh Province. This study aims to find a model of decision trees and rules. By using this method, it is expected to determine the characteristics of each pair of candidates for regional head. These characteristics are in the form of voter demographic information and the strengths of each candidate pair [3][9][10].

Methods
This research uses survey results of preference and electability survey of regional head elections in one regency/city in Aceh Province in year 2018. There are 4 candidates for regional head pairs (A, B, C, and D) who are divided in the classification. The compiler characteristics of the candidate pairs of regional heads are the sub-district of the voter (X 1 ), the age of the voter (X 2 ), the voter's last education (X 3 ), the main occupation of the voter (X 4 ), and the interesting factors of the candidate pair of regional head (X 5 ). Data was taken from 890 respondents from all villages in the regency/city [10].
The sampling in this study used stratified random sampling method, with the village being used as a stratum. Whilst for determining the number of samples of each village, a comparable allocation method is used, using the equation = / where is the sample number of each village to-, is the number of permanent voter lists (DPT) in the village to-, and is the number of (DPT) for all villages, and is the number of samples. There were 1000 respondents in determining the number of initial samples, and after the data cleaning, the number of samples that remained and could be processed was 890 respondents.

Results and Discussion
Election of regional heads in one regency/ city in Aceh Province whose data was taken for this study was followed by 4 pairs of candidates for regional heads. The candidate pair of regional head has certain characteristics based on voter demographics and interesting factors of the candidate pair of regional head. The mode for each attribute of the characteristic compiler of the prospective regional head pairs are shown in Table 1. The dominant respondents came from K2 sub-District (19.10%) with ages between 31-40 years (36.29%), with high school/ vocational educational background (45.17%) and dominated by farmers (33.26%). The majority of respondents chose regional head candidates because they were close to the people (often interact).
Data partitioning is done as a first step to dividing data into training data and testing data. This partition uses the k-fold cross validation method, where the number of groups (k) in this study is 10 [9]. Each data set is analyzed using decision tree C5.0. The results of this analysis are compared for each placement of testing data in data sets 1 through 10 (S1 to S10). Comparisons are made using the accuracy value and Kappa coefficient. The results of the analysis are shown in Table 2. The results of comparison of decision tree C5.0 based on the data testing layout obtained the highest accuracy value if the testing data in the 10th data set (S10), which is 0.8427. Similarly, for the Kappa coefficient, which is 0.7208. Further analysis was done using data testing in the 10th set and other data sets were used as training data. The Decision tree that is generated when data testing in the 10th set is presented in figure 1.  This decision tree model produces 3 rules. The three rules contain the dominance of class A and B as the class decision maker. The resulting rules are shown in Table 3.

Conclusion
The results of this study obtained the main predictor attribute in the optimum model is an interesting factor of the candidates and sub-districts origin of the voters. The best decision tree model is the model generated from testing data in the 10th data set. This is due to the accuracy of the model and the highest kappa coefficient, which is 0.8427 and 0.7208, resulting in the 10th data set. The number of rules produced is 3 rules with 5 nodes.