Comparative Analysis of Hybrid Model Performance Using Stacking and Blending Techniques for Student Drop-Out Prediction in MOOC

Despite being in high demand as a lifelong learner and academic material supplement, the implementation of Massive Open Online Courses (MOOC) has problems, one of which is the dropout rate (DO) of students which reaches 93%. As one of the solutions to this problem, Machine Learning can be utilized as a risk management and early warning system for students who have the potential to drop out. The use of ensemble techniques to build models can improve performance, but previous research has not reviewed the most optimal ensemble technique for this case study. As a form of contribution, this study will compare the performance of models built from stacking and blending techniques. The algorithms used in the base model are KNN, Decision Tree, and Naïve Bayes, while the meta-model uses XGBoost. These algorithms are used to build models with stacking and blending techniques. The experimental results using stacking are 82.53% accuracy, 84.48% precision, 94.12% recall, and 89.04% F1-Score. Meanwhile, blending obtained 83.39% accuracy, 85.31% precision, 94.21% recall, and 89.54% F1-Score. These results are supported by model testing using k-fold cross-validation and confusion matrix techniques which show the same results. That is, blending is 0.86% higher than stacking so it can be concluded that blending has better performance than stacking in the MOOC student dropout prediction case study.


Introduction
Massive Open Online Course (MOOC) was developed to fulfil the needs of lifelong learners or as a supplement to formal education learning [1].Although the enthusiasm of students is quite high, the reality of MOOC implementation is not free from problems.One of them is the dropout rate, which reaches 66% to 93% [2], [3].The causes vary, ranging from lack of social support, motivation, and perseverance [4], difficulty understanding the material, lack of interaction with the instructor [5], lack of understanding of learning goals and intentions [6] and lack of peer support [7] - [9].
The impact of students who drop out includes difficulty getting adequate employment and income in the future, thus worsening economic and social conditions [4], while MOOC organisers can affect reputation, rankings, and income [10].Therefore, risk management and early warning systems specifically for students who have the potential to drop out are needed, one of which is by utilising ML technology such as using classification algorithms.As a prediction system, ML can provide notifications to learners and instructors on a regular basis.For the learners, this can be used as motivation to continue completing the course.As for the instructor, it can be used as a basis for providing motivation and special attention because it can reduce the potential for dropouts by 14% [11].In addition, for the course organisers, the prediction results can be used to simplify the learning path or adjust the material provided [12].
There are several classification algorithms that are popular and used by previous researchers including Logistic Regression (LR), K-Nearest Neighbor (KNN), and Random Forest (RF) [2].The research was conducted by Zengxiao Chi, Shuo Zhang, and Lin Shing.The dataset used from the HarvardX Platform MOOC in the range of 2012 to 2013 with data totalling 416,921 rows and 21 features.After the pre-processing stage, 241,992 lines were filtered.Model testing was done with 5-fold cross-validation.In this study, RF obtained the highest accuracy value of 91.72%.Even so, these results can still be improved.
One way to improve the performance of ML models is ensemble learning.The ensemble is done by combining models in the first layer called weak learners that function as complex non-linear feature conversions and the second layer called meta learners utilizes residues from previous models [13].There are four techniques for building ensemble models including bootstrap aggregating (bagging), boosting, stacking, and blending.An example of a model built from bagging techniques is Random Forest, while an example of boosting is the Extreme Gradient Boosting (XGBoost) algorithm [14].
Stacking and blending techniques are built from multiple base models and one meta-model.The difference between the two techniques lies in the division of the dataset.If stacking uses a dataset that is divided into two for training and testing, in blending the training data is separated into ensemble and blender [13], [14].The ensemble data is used to train the base model and tested with blender data.The results are combined with blender data as new attributes trained on the metamodel and then tested with testing data.This makes blending not use overlapping data compared to stacking which in fact stacks prediction results data together for training and testing.
In the case study of dropout prediction in a massive open online course (MOOC), the use of ensemble algorithms built with stacking or blending techniques can improve the performance of the prediction model.One of the ensemble models proposed by Kumar et al. is called Ensemble Deep Learning Network (EDLN) [15].The dataset used is KDD Cup 2015 which contains student activity logs on XuetangX MOOC from China.The data selected is the first five weeks.The results of the study obtained an accuracy value of 97.4%.The accuracy value is still not confirmed for complex data during a one-course period.

Research conducted by Shou et al. by building a Multiscale Full Convolutional Network and Variational
Information Bottlenecks (MFCN-VIB) [16].The model can overcome noise in student behaviour time series data that may cause interference.The dataset used is the same as previous research, namely KDD Cup 2015.The results of this study are a precision value of 0.887, recall 0.960, F1-Score 0.922, and AUC 0.872.One of the weaknesses of this research is that the model built is quite complex so the execution time is longer, namely 133 seconds.In addition, the accuracy value is not written.
Another ensemble model proposed by Fu et al. called CLSA is a combination of Convolutional Neural Network (CNN) and Bi Long Short-Term Memory (LSTM) [17].The dataset used is the same as previous research, namely KDD Cup 2015 which has been pre-processed so that 60 thousand activity log data from 12 thousand student data are randomly selected and 7 features are selected related to behavioural characteristics in the first to fifth week.With CLSA, the accuracy was increased by 2.8% from the basic model to 87.6%.
Although the three studies above have concluded that ensemble can improve model performance, they still do not explain the most optimal ensemble technique to use between stacking and blending.Therefore, this study will conduct a comparative analysis of the performance of ensemble algorithms with stacking and blending techniques to determine the most optimal technique for improving model performance.To get accurate results, the datasets and algorithms used are made the same.
In order not to widen the research conducted, there are several limitations, namely the type of data used is single data and not time series data, hybrid models built for academic research purposes and not used for implementation in MOOCs and not for optimal learning path customization.The research starts from a literature study, model building is done with Google Colab using Python programming language and supported by libraries from SK-Learn, obtaining test results and conducting descriptive analysis to determine the most optimal ensemble technique.

Research Methods
This research compares the prediction performance of hybrid models built with stacking and blending techniques.In order to get an equal comparison, the base learner algorithms used include KNN, Decision Tree (DT), and Naïve Bayes.The meta-learner algorithm used is Extreme Gradient Boosting commonly abbreviated as XGBoost.An explanation of the flow and reasons for selecting the four algorithms will be explained in the next paragraph below.
The dataset used in this study is the same as the previous three studies, namely KDD   The research flow begins with a literature study aimed at finding research gaps or practical problems related to the case used as the object of research.Furthermore, collecting datasets from the site previously described.The pre-processing stage carried out is (1) ensuring data in the form of numerical numbers, (2) replacing empty values (null) to 0, (3) feature selection, and (4) data scaling which aims to optimize the potential for increasing data accuracy [18].The type of scaling performed is Standard Scaler with the aim of making the average zero and variance one [19].In the feature selection sub-stage, was done manually by selecting features related to user activity logs only and 22 features were successfully selected.The manual feature selection method can be further optimized by utilizing feature weighting techniques or genetic algorithms so that only strong features are selected.This can be used as a topic for further research.Then, to find out the correlation between features, it is visualized with a heatmap as in Figure 3.
Previous studies that applied DT are Park and Yoo [24], and Moreno-Marcos et al. [8].
There are three components in a decision tree including roots, branches, and leaves.The feature used as the root or root node is determined through the gain formula in Formula 3. To find the gain value, it is necessary to know the entropy through the formula in Formula 4. After all attributes become branches, the leaves can be determined whose values are classification labels.
Similar to the other two base model algorithms, Naïve Bayes functions as a non-linear feature converter.Some of the advantages of Naïve Bayes are simplicity, fast training and execution time, and good performance [17][18].The Naïve Bayes algorithm is processed based on the equation proposed by Thomas Bayes and known as Bayes Theorem with the formula written in Formula 5 and can determine the probability value of the target class.
The notation in Bayes Theorem is divided into two variables, X as the sample data of the unknown class and C as the hypothesis that X is the class data.P(X|C) is the probability based on the conditions in the hypothesis, P(X) is the probability of the observed sample data, and P(C) is the probability of the hypothesis C. The largest probability will be chosen as the prediction result.Previous research using Naïve Bayes is Zheng et al. [25].
The second layer uses the XGBoost algorithm.New frame data in stacking is used as training data and testing is done using testing data.In addition, in blending, blender data that has been added with features is used as training data and testing data that has been added with features is used as test data.Then, the test results are presented in tabulated form for easy reading and understanding.
The selection of the XGBoost algorithm is used as a meta-learner to utilize the previous model residue in the form of base model prediction results.XGBoost is an Error Rate K Value algorithm that applies the concept of Gradient Boosting Decision Tree (GBDT) [26] while improving performance by adjusting iterative learning features to reduce the loss function [27].The use of XGBoost was conducted by Wunnasri et al. [28] which is intended for the first phase of the model as a classification algorithm.
The advantage of XGBoost is that the computational process is 10 times faster and the accuracy value is higher than Random Forest [29].The prediction concept in XGBoost utilizes a decision tree.Formula 6 is a differentiable loss function to measure whether the model that has been built matches the training data and Formula 7 determines the complexity of the model [30].
As the complexity of the model increases, the corresponding score will decrease in value.
To validate the prediction results of the model that has been built, k-fold cross-validation and confusion matrix techniques are performed.To perform validation testing, all datasets that have been split will be combined into one and then processed or will be split again based on the iteration of the k-fold.Each k-fold iteration is calculated for accuracy, precision, recall, and F1-Score.These results will be compared and analyzed to produce a conclusion.
The research flow is designed and structured to get comparable results by making the same treatment, starting from the dataset used, pre-processing, the algorithm used, and the test validation technique.The difference is the separation of datasets, the training for the second layer model, namely stacking, uses the test results from the first layer which are stacked while blending uses the test results from the first layer to be used as additional features.Visually, the research flow is presented in Figure 7.

Results and Discussions
First, a hybrid model was built with the stacking technique.Training 180,713 data with KNN, Decision Tree, and Naïve Bayes algorithms and testing was done with 44,929 data.The k-fold value chosen is five which means the data will be divided into five subsets, one subset is used as testing and the other is used as training.
For each iteration, the confusion matrix will be calculated.In addition, the execution time is also calculated to determine the prediction speed.1.
KNN has the flexibility to determine the k value as the nearest neighbor circumference.The greater the value of k, the more neighbors there are so that it makes predictions more accurate, especially if the data classification is binary because the label is determined based on the majority of labels.However, prediction using KNN has the disadvantage of requiring a fairly long execution time on average of 213.00 seconds so the KNN model is suitable for predictions that require high accuracy and ignore execution time.
In addition, Naïve Bayes gets the fastest average execution time which is less than one second to be precise 0.29.Although the accuracy value obtained is not as good as KNN.Then, the DT algorithm gets the lowest accuracy value because it is unable to handle the complexity of the attributes used.The more branches that are built, the more complex the decision.So, DT is more suitable for data that has fewer attributes and according to Ang Ji and David Levinson's research bootstrap aggregating (bagging) techniques can overcome these problems [31].One implementation of the bagging technique is Random Forest which was introduced by Leo Breiman in 2001.Utilization of the previous residue makes XGBoost get an average value at a k-fold of 82.89%.While the average accumulation on the base model is 80.34% it can be concluded that the hybrid model built with the stacking technique can improve performance by 2.55%.However, in terms of execution time, XGBoost takes quite a long time, which is an average of 216.01 seconds.This is due to the complexity of the XGBoost algorithm and can be reviewed for further research regarding the most optimal and fast algorithm to be used as a metamodel.
Second, the next experiment is to build a model with the blending technique.Slightly different from stacking, blending does not use the base model test results as training data on the meta-model but the test results will be added to the blender data and testing data as new attributes and will be trained on the meta-model so that initially 22 features become 25 features.The large amount of data and features used can affect performance, but the blending technique can overcome this by separating the training data in each layer so that it does not accumulate like stacking.3.
The data pattern in these results is the same as the base model stacking test.KNN gets the highest score in terms of accuracy.Compared to stacking, KNN on blending gets a higher accuracy value of 0.20% and the execution time is faster.Then, the Decision Tree gets the lowest accuracy value compared to the other two algorithms and there is an increase in the accumulated accuracy value of 0.24%.The same thing happens to Naïve Bayes, which is an increase of 0.12%.Here Naïve Bayes is the algorithm with the fastest execution time which only takes 0.27 seconds to do prediction.As explained earlier, the results are not put together in the frame data as in the stacking technique, but the results are put together with blender data and testing data as new features named 'knn_predictions', 'dt_predictions', 'nb_predictions'.Previously, the features used were 22 features, then three new features were added so that there were 25 features.The assumption of adding these new features is that there is an expected performance improvement when tested with k-fold cross-validation and confusion matrix.[15].The dataset used was only the first five weeks of the course and the amount of data was not written.The accuracy of the model is 97.4%.These results are influenced by the amount of data used and cannot be confirmed for the same results on complex data and over a specific period.In comparison, in this research, the blending achieved an accuracy of 89.35% and was built with all the data used in the 2015 KDD Cup with a total of 225k data, making it more complex.
Despite getting good results, this research still has weaknesses including the hybrid model of the blending technique still has the longest execution time and there is no testing based on the Area Under Curve (AUC) which can visualize all possible classification thresholds [32].Future research can experiment with finding the best combination of algorithms that can reduce the execution time to be built in hybrid with blending techniques or apply metaheuristic type optimization techniques.

Conclusions
The results prove that ensemble or hybrid models can improve accuracy.This is in line with the results of the three studies listed in the previous section.However, there are some things that differ such as the amount and shape of the data as well as the complexity or combination of algorithms chosen to build the model.Building a hybrid model with stacking gets an accuracy value of 82.53% while blending gets an accuracy value of 83.39%.This means that blending gets 0.86% higher results in the case study of dropout students in MOOCs with data classification in the form of binary, single data and not time series.
In addition, as a result of the additional lines of code in the hybrid model, the model execution time becomes longer.This can be a gap for further research to improvise the model so that the execution time is faster.
In addition, related to the related features used, can be reviewed and selected again to ensure the correlation between features is strong, such as using genetic algorithms.Then, it can use metaheuristic optimization techniques such as Particle Swarm Optimization (PSO), Ant Colony, or Komodo Mlipir Algorithm (KMA).
The results of this research are expected to provide inspiration and reference for similar research, namely dropout prediction in MOOCs using ensemble algorithms.In addition, the research results can be applied to the real world as an early warning system that sends regular notifications so as to reduce the potential for students to drop out.The information can be utilized by teachers to provide intensive guidance and for the MOOC system to determine a dynamic learning path so that the course can still be completed by students.

Figure 1 .
Figure 1.Training Data Class Distribution

Figure 2 .
Figure 2. Testing Data Class Distribution

Figure 3 .
Figure 3. Feature Correlation Heatmap After pre-processing, data splitting is done.The stacking technique does not require data splitting anymore because it only uses training data and test data.Meanwhile, the blending technique requires splitting the training data with a ratio of 60:40 so that it becomes ensemble data totalling 108,427 and blender data totalling 72,286.The class distribution on the ensemble data is 82,515 DO classes and 25,912 non-DO classes.And then, the class distribution on the blender data is 54,722 DO classes and 17,564 non-DO classes.Visually, the class distribution on the training and test data is presented in Figure 4 and Figure 5.The ready data will be subjected to data training and testing processes with three algorithms in the first layer including KNN, Decision Tree, and Naïve Bayes.The stacking technique performs training and testing using training data and test data.While the blending technique conducts training with ensemble data and testing with

Figure 6 .
Figure 6.Error Rate k Value KNN aims to find the closest distance or the highest similarity value.The stages in building the KNN model are (1) determining the value of k, (2) calculating the Euclidean distance with the formula in Formula 1, (3) determining the closest distance with the minimum

Figure 7 .
Figure 7. Research Flow In stacking the test results are collected into one new data frame and in blending the test results are used as new features in the blender data and testing data.

Table 1 .
Stacking Technique Base Model Testing Results Score 89.04%.Then, the results of testing the stacking model with k-fold cross-validation and confusion matrix techniques are shown in Table2.

Table 2 .
Stacking Technique Hybrid Model Testing Results

Table 3 .
Blending Technique Base Model Testing Results

Table 4 .
Blending Technique Hybrid Model Testing Results [16]ding has better precision, recall, and F1-Score compared to the MFCN-VIB model proposed by Shou et al. and claimed to be able to overcome noise[16].However, in the variable execution time, MFCN-VIB is faster with a considerable difference of 203.14 seconds.