Application of a combination between Principal Component Analysis and Logistic Regression Based on Support Vector Machine on Educational Data Mining with Overlapping Data Problem

In 2019, the government of the Republic of Indonesia issued a zoning-based policy for New Student Admissions (PPDB) from the level of elementary school (SD) to high school (SMA), especially for public schools. The policy is documented in Permendikbud No.51 / 2018. The government policy aims to ensure the equality of education and make prospective students not focus only on favorite schools. However, this policy raises new problems. One of them is that if the potential student has got a medium UN (National Examination) score or medium distance of the house to the destination school, then his potential to be accepted at the destination school is very small. It is even worse if the potential students do not know the lowest score and the farthest distance the destination school can accept. Thus, potential students will choose schools by only guessing without basing on valid data, so their chances of being accepted will be very small. This current research focused on Educational Data Mining at PPDB Public High School (SMA) in Jombang in the academic year 2019/2020 which aims to accommodate the needs of potential students to predict the destination schools based on their own grades and home distances using classification techniques of data mining. However, another problem emerged in this study. An overlapping data occurred where one data was also owned by more than one class. For example, a potential student of SMA Negeri 2 Jombang (SMAN 2 Jombang) has got a score of 80 in Bahasa Indonesia subject, which is the same as that of a student from SMA Negeri 3 Jombang (SMAN 3 Jombang). Data overlapping does not only occur in one data but almost all of the data. The data used in this study were 600 data, consisting of 308 from PPDB 2019 of SMAN 2 Jombang, and the rest were from SMAN 3 Jombang. The attributes used were the home distance from the destination school, overall UN scores, UN scores of Mathematics, Natural Sciences, Bahasa Indonesia, and English subjects. The algorithm used was a combination of Principal Component Analysis (PCA) with Logistic Regression (LR)-based Support Vector Machine (SVM) with Anova kernel. The validation applied 10-fold cross-validation and the evaluation of algorithm performance used the aspects of accuracy, precision, and recall. The results of this current study showed an accuracy of 94.33%, a precision of 96.28%, and a recall of 92.53%. The results were better than those that did not apply PCA (70.83% accuracy, 69.62% precision, and 76.62% recall). By PCA, data could be seen from another angle that could separate or differentiate one class from the others. Even though there were 100% overlapping data, none of them, from all attributes, was 100% exactly the same.


Introduction
In 2019, the Republic of Indonesia government issued a zoning-based policy for New Student Admission (PPDB) process starting from elementary schools (SD) to senior high schools (SMA), especially for public schools. The policy is documented in Permendikbud No.51 / 2018 [1]. The policies are made to make the quality of education in Indonesia equitable and there will be no favorite school anymore which commonly relied heavily on the best scores of the students in the admission process. This policy raises new problems for the potential students whose scores are moderate and home distance from the destination school is not so close, especially in densely populated areas such as urban areas. Their potential to be accepted in the destination school is very small. If potential students' homes are close to public schools, they will tend to underestimate the result of the National Examination (UN). An example of complaints came from one of students' parents, Albert Mercelino, whose son registered in SMA Negeri 1 Kuta Utara, who said, "Keep using general zoning, not a radius. Use the score ranking too. That's fair. The zoning policy is just like we teach our children not to struggle. The fools will relax. Learning or not, the most important thing is close to (school)" [2]. The second complaint came from Fitri Suhermin, one of the parents of a potential student of SMP Negeri 8 Surabaya, as quoted from the Indopos daily, "Disappointment because her child could not be accepted in SMP Negeri 8 which was only 700 meters from his home, which was due to the zoning system." [3]. This current research was conducted to predict the potential students who were potentially accepted in the Public Senior High Schools in Jombang Regency, East Java Province, Indonesia by implementing Educational Data Mining with classification techniques. The general goal of this current research was to support the government policy and help provide recommendations to potential students in choosing the potential schools. However, in this current research overlapping data problems occurred, in which each datum overlaps one another among classes (note Table 1 printed in red), which were the data accepted in SMA Negeri 2 Jombang (SMADAJO) and those accepted in SMA Negeri 3 Jombang (SMAGAJO). Overlapping data problem is not only in one attribute of each data, but all the attributes of all data. Any algorithm undergoing overlapping data problems will not produce a good classification [4]. The solution is by removing dimensions or attributes. If not removed, the existing data may not be able to distinguish among classes, because there are the exact same data in both classes [4]. According to Gu and Cheng, overlapping data problems become one of the obstacles in data mining and pattern recognition because they affect the accuracy of the classification and the ability to generalize directly [5]. The specific purpose of this current research is to find out an algorithm that is suitable for overcoming the overlapping data problems. The solution to overlapping data problems is like Suyanto's statement, [4] "To well understand a problem, look at it from another perspective". For example, by looking at data from a certain angle, they would look normal, not overlap.
To realize the objectives of this research, several algorithms were chosen according to the characteristics of the research data. One of the algorithms commonly used to overcome an overlapping data problem is the Principal Component Analysis (PCA) algorithm [4] [6] [7]. According to Suyanto [4]], "PCA is a mathematical method that transforms data into a new domain which results in many more important principle components". Whereas for overcoming overlapping data problems, the most  [8] based on Support Vector Machine (SVM) which refers to myKLR [9], because it has been proven to have good resistance [10]. The kernel used on both PCA and SVM is the Anova kernel, because according to Stitson et al. [11] ANOVA has proven to have good multi-dimensional performance.

Proposed method
The method proposed involved collecting datasets, dimensionality reduction, classification, validation, and evaluating the performance of classification algorithms. Illustrations related to the proposed method are presented in Figure 1. The data in this study were obtained directly from PPDB of the Government of the East Java Province of Indonesia in the academic year 2019/2020. The data sources were two public high schools in Jombang: SMADAJO and SMAGAJO. Both of them were used as classes in this study. The number of data was 600 data, consisting of 308 students of SMADAJO, and the rest were those accepted in SMAGAJO. Six attributes were applied: The home from the chosen school, the total scores of the National Exam (UN), the UN scores of Mathematics, Natural Sciences, English, and Bahasa Indonesia. All attributes for all data underwent overlapping data problems, but not all had exactly the same data between those in the SMADAJO and SMAGAJO classes. Data could only be accessed when the PPDB registration was opened. The East Java Provincial PPDB system could be accessed on https://ppdbjatim.net. All data were numeric both integer and real. The illustration of the dataset in this study can be seen in Table 1.

Dimensionality reduction
Overlapping data problem is one of the common problems in data mining and pattern recognition [5]. Adopting Suyanto [4], the illustration of data undergoing overlapping data problems were presented in Table 2 in the form of five objects with three attributes or dimensions (length, width, and height in meters) grouped into two classes (Tables and Chairs); Table 3 was in the form of five objects with two attributes or dimensions (length and width in meters) grouped into two classes (Tables and Chairs). The height dimension was removed because they could not distinguish between classes. In Table 2 and  Table 3 there were overlapping data problems marked with red-printed scores. In figure 2 visualizing the data of five objects in a two-dimensional space (length and width), the position of a circle was close to two triangles; In figure 3 visualizing the data of five objects in one dimension of width (the dimension of length was removed), two overlapping objects occurred because both had the same width  figure 4 visualizing the data of five objects in one dimension of length (the dimension of width was removed), two overlapping objects occurred because they had the same length of 2.1; In figure 5 visualizing the data from another perspective. Tilting head to the right or left to see whether the data were randomly scattered or rather normal? The visualization of data could be handled by PCA, where the first principal component (PC1) and the second principal component (PC2) were the new dimensions in the PCA realm. The five objects could be separated into two different classes (triangles and circles) with one dividing line (dashed).    Figure 5. The visualization of the data could be handled by applying PCA [4] In this study, all selected algorithms were run by using RapidMiner with ANOVA kernel and the gamma score of 1 and a degree of 3. The scores were the default scores of RapidMiner. This ANOVA kernel was also used in the SVM kernel.

Classification
The classification algorithm involved three algorithms: Logistic Regression, Logistic Regression based on Support Vector Machine (SVM) (in RapidMiner known as Logistic Regression (SVM) referring to myKLR by Rueping [12]), and PCA with Logistic Regression (SVM). Three algorithms were chosen to confirm the expectation of this research that the combination of PCA, Logistic Regression (SVM) and ANOVA kernel would result in better results compared to other algorithms. All algorithms in this study, from beginning to end of the process, applied RapidMiner. Figure 6. Illustration of the stratified 10-fold cross validation [13]

Validation
Validation in this study used the default from RapidMiner, namely 10-fold cross-validation with Stratified random sampling. The stratified random sampling was chosen because the distribution between training sets and testing sets can be spread evenly compared to other random samplings. The illustration of stratified 10-fold cross-validation is presented in Table 4 refers to Wahono [13].

Evaluation
The evaluation of algorithms applied confusion matrix which may produce accuracy, precision, and recall. These three evaluations were chosen because accuracy is not enough; another evaluation was needed to ensure that the performance of the algorithm in this study was really good. The illustration of the difference between accuracy, precision, and recall is presented in Figure 7 which refers to Data [14]. The illustration of the confusion matrix is presented in Figure 8 [15]. In True Positive (TP), a datum is predicted to be wrong because the actual datum is wrong. In False Positive (FP) a datum is predicted to be wrong but the actual datum is not wrong. In True Negative (TN), a datum is predicted not to be wrong because the actual datum is not wrong. In False Negative (FN), a datum is predicted not to be wrong but the actual datum is wrong. The formulas for accuracy, precision, and recall are shown in Table 5. Figure 7. Illustration of the difference between accuracy, precision, and recall according to Data [14]

Experimental Results
The experiments in this study were carried out three times, using Logistic Regression, Logistic Regression (SVM), and PCA with Logistic Regression (SVM). The results of this study are presented in the form of a confusion matrix, as in Figure 8 which presents the results of Logistic Regression. Figure 9 presents the results of the Logistic Regression (SVM). Figure 10 presents the results of a combination of PCA with Logistic Regression (SVM). A more complete result is presented in Table 6.   Figure 8, Figure 9, Figure 10, and Table 6 it was proven that the combination of PCA and Logistic Regression (SVM) with ANOVA kernel had better performance, as printed in blue; whereas the Logistic Regression had very poor performance, as printed in red. Therefore, it can be concluded that overlapping data problems can be overcome by PCA with Logistic Regression (SVM) with ANOVA kernel for predicting potential students to be potentially accepted in selected high schools. For further research, especially in the case of multi classes, it is better to know whether they can be solved by applying the same method, or whether the method proposed will only be suitable for binary classification cases. This is because there are more than two Public high schools in Jombang, and the potential students can choose not only SMADAJO and SMAGAJO.

Conclusion
The problem of this research is related to the overlapping data problems that occurred in the PPDB dataset of Public High Schools in Jombang which could be overcome by applying a combination of PCA algorithm and Logistic Regression (SVM) algorithm with ANOVA kernel. Based on this study, the proposed method successfully outperformed compared to not using PCA, with an accuracy of 94.33%, precision 96.28%, and 92.53% recall. For further research, it is better to prove whether the proposed algorithm can also be applied to multiclass cases, not just binary classes as in this study.