Tuning Data Mining Models to Predict Secondary School Academic Performance

: In recent years, educational data mining has emerged as a growing discipline focused on developing models for predicting academic performance. The primary objective of this research was to tune classification models to predict academic performance in secondary school. The dataset employed for this study encompassed information from 19,545 high school students. We used descriptive statistics to characterise information contained in personal, school, and socioeconomic variables. We implemented two data mining techniques, namely artificial neural networks (ANN) and support vector machines (SVM). Parameter optimisation was conducted through five–fold cross–validation, and model performance was assessed using accuracy and F 1 –Score. The results indicate a functional dependence between predictor variables and academic performance. The algorithms demonstrated an average performance exceeding 80% accuracy. Notably, ANN outperformed SVM in the dataset analysed. This type of methodology could help educational institutions to predict academic under-achievement and thus generate strategies to improve students’ academic performance.


Introduction
Academic performance is conceived as a construct that depends not only on student motivation but also on other factors that may affect it, such as the teacher-student relationship, availability of study tools, access to computers and internet service, among others.In addition, there are demographic, socio-economic and psychological variables that contribute to the performance of any student, whether in secondary or higher education [1].García-Tinisaray [2] defines academic performance as the main indicator of student success or failure and believes that it has been considered one of the important aspects when analysing the results of the teaching-learning process.On the other hand, in educational institutions, academic performance is an indicator of educational efficiency and quality.The academic performance of students is one of the main problems faced by secondary education institutions due to the high failure rate in some subjects that leads to poor performance during the school year and, in some cases, to student dropout [3].The results of PISA 2022 show that 25% of 15-year-old students in Organisation for Economic Co-operation and Development (OECD) countries had not achieved a basic level of proficiency in at least one of the three main subjects assessed by PISA: reading, mathematics and science.In absolute numbers, this means that nearly 13 million 15-year-old students in the 64 countries and economies participating in PISA 2022 showed low performance in at least one subject [4].
The PISA tests conducted in 2022 show that developed countries such as Canada, Denmark, Finland, Hong Kong (China), Ireland, Japan, Korea, Latvia, Macao (China) and the United Kingdom boast the best results with percentages below 10% of students with low academic performance in the three areas evaluated.They also show that countries such as Spain have moderate levels of low academic performance (18.3% and 23.6% for reading and mathematics, respectively).It should be noted that these results are above the average of all the countries evaluated by the OECD [5].
The panorama in Latin America is quite discouraging.Perú is the country with the highest percentage of 15-year-old students who do not reach the basic level established by the OECD (60, 68.5% and 74.6% in reading, science and mathematics, respectively) [6].Brazil and Argentina obtained similar percentages for reading, science and mathematics (Brazil: 50.8, 55 and 68.3%; Argentina: 53.6, 50.9 and 66.5%) [6].In Colombia, the problem is very similar to that of other Latin American countries.The percentage of students with low academic performance was 51.4% for reading, 56.2% for science and 73.8% for mathematics, which shows a complicated perspective regarding students' performance during their high school years.Students' poor performance in the school stage will affect their performance in the following learning phases [6].On the other hand, in Colombia, one way to measure a student's academic performance is through the Saber 11 tests, an evaluation conducted by the Colombian Institute for the Evaluation of Education (ICFES) [7].The Saber 11 test consists of 268 questions that evaluate five competencies: critical reading, mathematics, social sciences, natural sciences and English.There is currently a challenge in collecting socio-economic, school and personal variables to predict academic performance in the Saber 11 tests.This challenge is constituted by the absence of numerical variables in the data set, so the exact values of these characteristics are not known since most of the variables are categorical.At present, the dependence or functional relationship between the personal, socio-economic and school variables of Colombian students and their performance in the competencies evaluated in the Saber 11 tests is not known.Studies on academic performance prediction have been conducted in different departments of our country.The models that have been developed and implemented use the results of the Saber 11 tests to predict performance in different university courses [8][9][10][11][12][13][14].

Related Work
Data mining can be used to predict academic performance outcomes in both secondary and higher education students.Studies conducted in the Netherlands in 2017 compared different data mining techniques (ANN, SVM, logistic regression, and Naive-Bayes) for predicting academic performance in students in virtual courses.The researchers demonstrated that the predictive performance of an ANN outperformed other classifiers in terms of accuracy.However, the ANN and the other six classifiers did not outperform the findings of other studies, probably attributable to the difference in predictor variables used and the study setup [15].Cuevas-Redondo and Estévez Bravo [16] used decision trees and linear regression for the improvement and prediction of academic performance.In the tests performed, linear regression had an error rate of more than 64% compared to the 56% maximum obtained by decision trees.For this reason, in the case of a database with little information, they do not recommend the use of linear regression for the prediction of academic performance.Abu Saa et al. [17] used multiple data mining tasks with the goal of creating qualitative prediction models that were efficient and effective in predicting student grades from a set of collected training data.The researchers implemented data mining tasks on the data set (personal, social and academic data) in question to generate classification models and evaluate them.They implemented four decision tree algorithms as well as the Naive Bayes algorithm.The results showed that student performance does not depend entirely on their academic efforts, even though there are many other factors that have equal influence.The authors conclude that the use of data mining can motivate and help universities find interesting results and patterns that can help both the university and the students in many ways.Kabakchieva [18] implemented classification models, making use of four data mining algorithms (association rules, decision trees, ANN and k-Nearest Neighbour.These algorithms were applied to the available student data and carefully preprocessed.The results reveal a classification accuracy between 67.46% and 73.59%.The highest accuracy is achieved for ANN (73.59%), followed by the decision tree model (72.74%) and the k-NN model (70.49%).The most influential factors in the classification process are the data attributes related to the college admission score and the number of failures in the first-year college exams.The study by Romero et al. [19] applied data mining techniques to predict the academic performance of first-year undergraduate students.Among the techniques they used were decision trees, Naive-Bayes and ANN.The results showed that the first two had an equal accuracy of 60.52%, while for the ANN, the accuracy was 54.47%.Regarding the accuracy of the models, the classification percentages were not very high, which indicates the difficulty of the problem of academic performance prediction as it is affected by many factors.
In Latin America, the amount of research conducted in this field is increasing.In Brazil, research on the usefulness of artificial learning in education has been developed.De Melo [20] reported that one of the ways found to improve the educational system in Latin American countries is to track students.This could be done by collecting and analysing data to find functional dependencies between sociodemographic variables and academic performance.The work aimed to evaluate machine learning techniques on data sets of technical high school students.The results of the research show that the use of machine learning, specifically supervised classification learning, is very useful for the creation of help tools that can monitor and predict student performance.It also showed that decision trees can make predictions of 89% to 94% accuracy.Menacho-Chiok [21] conducted research where several data mining techniques (logistic regression, Naive-Bayes and neural networks) were applied using students enrolled in a subject at the Universidad Nacional Agraria La Molina in Peru.The results indicated that the Naive Bayes algorithm obtained the highest classification rate at 71%.Socioeconomic variables have a considerable influence on students' academic performance, therefore, the author recommends using them in order to improve the predictive model.
In Colombia, academic performance has been widely studied in diverse populations.Models have been developed and implemented for the prediction of the academic performance of students in higher or secondary education.Merchan-Rubiano et al. [10] have conducted several investigations using applied decision trees with the results of the Saber 11 tests (formerly called ICFES tests) and sociodemographic variables for the prediction of academic performance in engineering courses, with accuracy results of up to 86%.According to the results, the prediction of academic performance in the first year of university reduces the possibility of academic desertion and also improves the quality of the students' formative processes, allowing the orientation of preventive monitoring strategies for people with real possibilities of suffering academic risk [10][11][12].
Moreover, the outcomes of four out of the five competencies of the Saber 11 test, excluding natural science, have been used to predict whether a given student might either fail or drop out of any course during the first semester of the systems engineering bachelor's degree program at the University of Córdoba in Colombia [8].In the same context, it has been studied whether considering the outcomes of all competencies contributes to predicting whether a student might fail any course related to mathematics or physics during the first term [9].
In [22,23], the goal is to determine the factors that influence the outcomes achieved by students in the Saber test.The former study focuses on Cundinamarca, a region of Colombia, during the period from 2017 to 2021 [22], while the latter examines the period from 2012 to 2022, analysing the performance of students who took the Saber 11 test in Bogotá and comparing their outcomes with those of students in the rest of Colombia [23].Both studies differ from ours because they adopt descriptive statistical methods in lieu of predictive data mining techniques, whereas we utilise machine learning.Furthermore, in [23], the study aims to determine the effect of the lockdown caused by the COVID-19 pandemic on Colombian students' performance in the Saber 11 test.The study adopts linear regression analysis to estimate the weight of the variables that influence students' outcomes in the test, while we use classification methods in our study to predict students' performance.The drawback of the findings in [23] is that the coefficient of determination is less than 0.5, suggesting that the prediction function has limited predictive power and might not be significantly better than using the mean of the target variable.
Another study conducted in Colombia using data mining for the prediction of academic performance is entitled "Discovering patterns of academic performance in the critical reading competency".The study executed by Timarán-Pereira and collaborators used decision trees to discover patterns of academic performance in the generic competencies of students who took the Saber Pro tests.The study included results of these tests in different regions of the country (Bogota, Eje Cafetero, Caribbean, Central East, Pacific, Central South and Llano).The results obtained with the decision tree classification model indicate that it is capable of generating models consistent with the observed reality and the theoretical support, based only on the data stored in the ICFES databases.The authors reported difficulties in the development of the research due to the poor quality of the data in the ICFES databases, where they had to discard certain attributes due to the impossibility of obtaining their values from other sources, and which, in some way, could influence the discovery of the patterns object of this study, in addition to the great consumption of resources involved in the data cleaning and transformation process [13].
The vast majority of studies conducted in Colombia on the subject use the results of the Saber 11 tests as predictor variables for the prediction of student performance in undergraduate programs, but few have been conducted for the prediction of test performance using classification tasks.A single report by [24] used a regression task for the prediction of the subject's critical reading and mathematics, but the author's publication does not report the performance of the model generated.Another effort on the subject comes from researcher Ferney Rodriguez, who is carrying out a project for the prediction of academic performance, but to the best of our knowledge, so far there is no report of the results obtained in the aforementioned project.The main objective of the author is to find which variables have the greatest influence on academic performance in the Saber 11 tests.Finally, as far as we know, in our department, there are no studies that use personal, school and socio-demographic information variables from the ICFES database to predict student performance in the Saber 11 tests.

Mathematical Notation
This section shows the mathematical notation used in this document.Vectors will be represented in lowercase letters (x) and matrices with capital letters (X).Both with bold letters.The superscript T will denote the transpose of a matrix.The superscripts (i) and [i] are used to denote the i-th unit in the i-th hidden layer of an ANN.
Vectors are, by default, represented as columns.The notation a ∈ R p will be used to indicate that an object is scalar, the notation a ∈ R p to indicate a vector of length p.To indicate that an object is a matrix, we will use the convention A ∈ R r×s .
On the other hand, we will use m to represent the number of observations or training examples.We will denote n as the number of variables available in the data set.We will also denote X ij representing the value of the jth variable for the ith observation, where i = 1, . .., m and j = 1, . .., n.Therefore, i will be used to index the training examples (from 1 through m) and j to index the variables used for training the model (from 1 through n).

SVM
SVMs are one of the most powerful and widely used learning algorithms.This technique has its roots in statistical learning theory and has shown satisfactory results from image recognition to text categorisation.It is a method that works well with high-dimensional data and avoids the curse of dimensionality.Alpaydin [25] defines it as methods that allow the model to be written as the sum of the influences of a subset of training instances.
The optimisation objective in SVM is margin maximisation.The margin is defined as the distance between the separation hyperplane (decision boundary) and the training instances that are closest to the hyperplane, which are called support vectors (see Figure 1).The SVM method can be classified into two types: linear SVM and nonlinear SVM.In turn, SVM can be applied for separable and non-separable cases.

Linear SVM for Separable Cases
The linear SVM, also known as the maximum margin classifier, finds a hyperplane with the largest possible margin [27].If we consider a binary classification problem of n training examples, each example could be denoted as a tuple (x i , y i ), for i = 1, 2, . . ., m, where x i = (x i1 , x i2 , . . ., x in ) T .By denoting two classes (C 1 and C 2 ) using −1/+1, we have that The decision limit for this case can be described in the following equation: where w y w 0 are the model parameters.The goal of this classifier is to find w and w 0 such that: which can be rewritten as follows: furthermore, a separating hyperplane has the property that: Thus, it follows that if a separating hyperplane exists, it can be used to construct a classifier where an observation with an unknown class is assigned to a class depending on which side of the hyperplane [28] is located.If we consider the task of constructing a maximum margin hyperplane based on a set of m training observations {x 1 , . . ., x m ∈ R n } associated with classes C 1 and C 2 , the maximum margin hyperplane is the solution to the following optimisation problem [28]: The constraint (Equation ( 6)) in the above optimisation problem ensures that each observation is on the correct side of the hyperplane and at a distance M from the hyperplane.Thus, M represents the margin of the hyperplane, and the optimisation problem chooses w 0 , w 1 , . . ., w p to maximise M [28].Maximising such a margin is equivalent to minimising the following objective function: Thus, the learning task for this case can be formalised as the following optimisation problem: Finally, considering that the objective function is quadratic and the constraints on the parameters w and w 0 are linear, the Lagrange multiplier method is used to solve the optimisation problem [25].The new objective function known as the primal formulation is shown in the following equation: Because the optimisation problem remains a complicated task due to the number of parameters it has: w, w 0 and lambda i , it can be simplified by transforming the function so that it only depends on the Lagrange multipliers.It is known as the dual formulation and would look as follows:

Soft-Margin Hyperplane
A maximum margin classifier can be used for classification as long as a separation hyperplane exists.However, in many cases, such a hyperplane does not exist, so we would not have a maximum margin classifier.In this case, the optimisation problem proposed in Equation ( 6) has no solution with M > 0 [28].In the above case of linear SVM for separable cases, we assume that error-free decision limits are constructed.The formulation of the method can be modified so that it learns a decision boundary that is tolerable to small errors in the training observations (Figure 2).To achieve this goal, a method known as the soft margin approach is used [29].In this way, it is possible to construct a linear decision boundary when classes are not linearly separable.The objective function (Equation ( 7)) is applicable in this case; however, the decision limit no longer satisfies its constraints, these are modified using a positive relaxation variable (x i ) [30], as can be seen below: The objective function in Equation ( 7) is modified to penalise the classifier when the training examples are located on the wrong side of the decision boundary, this function is given by the following equation: where C and k are user-specified parameters to penalise the classification error of the training instances.The dual formulation in this case is the same as for the separable case (see Equation ( 10)); however, the multipliers used for the non-separable case are different.

SVM on Nonlinear Classification Problems
One of the reasons why SVM has gained so much popularity in the field of data mining is its ability to be kernelised for the solution of nonlinear classification problems.To accomplish this task, it is necessary to map the problem to a new space by conducting nonlinear transformations using basis functions and then using a linear model in this new space [25,31].A basis function ϕ(•) will allow the training data to be transformed into a higher dimensionality feature space; however, a drawback with this approach is that creating new features is computationally expensive.To solve this, the kernel trick is used with the following function: One of the most commonly used kernels is the radial basis function, also called the Gaussian kernel: where γ is a parameter to be optimised.The term kernel can be interpreted as a similarity function between two training samples.Another function used in SVM to deal with nonlinear problems is the sigmoidal kernel function: where tanh(•) has the same form as the sigmoidal function but with the difference that its range is between −1 and +1 [25].

SVM for More Than Two Classes
There are currently two versions of SVM that allow the classification of more than two classes in a dataset.First, we have the One-Versus-One classification, which compares each pair of classes by assigning +1 and −1 to each.The test observations are classified using the different classifiers constructed and the number of times the observation is assigned to each of the classes is counted.The other alternative is known as One-Versus-All, where a class (assigned as +1) is compared with the rest of the observations as another class (assigned as −1) [28].

ANN
The ANN models relationships between a set of input signals (input layer) and an output signal (output layer).For such a function, it uses a model derived from our understanding of how the brain responds to sensory input stimuli.Our brain uses a network of interconnected cells called neurons [32].The structure of an ANN can be seen in Figure 3.  [33].In this shallow network, with a single hidden layer, the inputs to the artificial neurons in the output layer are not the original variables.Hence, this fully connected network maps the input variables in a p-dimensional space to an M-dimensional space, where there might exist a hyperplane decision boundary to classify the input vectors.Thus, ANNs are known as universal approximators, where the output layers calculate their predictions based on a new representation of the original input variables obtained in the hidden layers.
An important feature of ANNs is their structure, which may contain the input layer, several intermediate layers and the output layer.Such intermediate layers are called hidden layers and their nodes are called hidden units.In addition, the network uses activation functions such as the sigmoid function, hyperbolic tangent, Rectified Linear Unit (ReLU) function, among others [29].The learning process of an ANN consists of two main steps: first, forward-propagation, which consists of computing the model predictions and the error; the second step consists of updating the parameters generated in the previous step to decrease the error.This last step is known as backpropagation.First, it is applied to the hidden layer of the network: [1] X + b [1]  (17) Then, applying the activation function σ [1] for the units of the hidden layer, we have: [1] (Z [1] ) Subsequently, the values for the output layer are calculated.First, the linear function Z [2] and finally the nonlinear activation function: A [1] + b [2]  ( Equation ( 20) calculates the predictions in the output layer: ŷ = A [2] = σ [2] (Z [2] ) The last thing to calculate is the model error using the function known as cross-entropy:

Dataset
The data used for this research were provided by ICFES [34].The information provided corresponds to Córdoba during the second semester of 2017.The data contained 80 predictors related to different characteristics: personal, socioeconomic and school.The variables with personal information about the students are briefly described in Table 1, the socioeconomic variables are described in Table 2 and the school-related variables are briefly described in Table 3.In addition, the data contained information on 5 subjects evaluated in the exam (target variables), such as critical reading, mathematics, social sciences, natural sciences and English.The level of academic performance is classified taking into account the score obtained in each subject (see Table 4).The data set is relatively balanced in terms of these classes, as illustrated in Figure 4.The bar chart in this figure shows that the number of records is nearly equal across all performance classes.

Data Preprocessing
Data preparation was performed using the statistical programming language R [35].The first step was to split the initial database by each subject.An oversampling technique was used to balance the classes when they were unbalanced.Subsequently, categorical variables were coded to numerical values.Missing values were set to 0. Classes or labels (subjects evaluated in the Saber 11 exam) were coded to integer values and constant attributes were eliminated.The rest of the variables presented different scales; therefore, they were standardised with a mean of 0 and a standard deviation of 1.The standardisation procedure can be expressed in the following equation: where x si is the standardised feature vector, µ is the sample mean of each attribute, and σ corresponds to the standard deviation.Each of the preprocessed databases contained information on 19,545 students and 50 columns, including the target variable.Each was divided into two parts, one for cross-validation (50%) and another for testing (50%).

Characterisation of the Evaluated Students
The preprocessed data set was used to characterise the students who took the Saber 11 tests in the second period of 2017.A secondary variable (age) was created from the date of birth.Descriptive statistics were used to know the average and standard deviation of numerical variables and the frequency distribution for qualitative variables.The factors taken into account to develop the experiments were classified according to the ICFES in personal information variables as seen in Table 1.The socioeconomic variables are shown in Table 2.The characteristics related to the educational institutions where students attend are shown in Table 3.

Experiment 2: Tuning ANN
ANNs were created using the RStudio interface of the Keras [37] library, a high-level API that uses TensorFlow for its execution.TensorFlow is an artificial learning platform developed by Google [38].Regarding the network configuration, a single hidden layer was established.For optimisation, different combinations of hyperparameters (units in the hidden layer and λ values for regularisation) were tested.For the hidden layer, the activation function (ReLU) was used.The weights were updated with 10 epochs and batches of 32 training examples.The optimisation method called Adam [39] was used with an adaptive learning rate.The multilayer network was optimised using categorical crossentropy as a cost or error function.The activation function (softmax) was implemented in the output layer.The λ values used were: 0.001, 0.01, 0.05, 0.25, 0.5; and the hidden layer sizes were: 30, 50, 80 and 120.

Evaluation
We adopted the K-fold cross-validation method as described by Mohri, Rostamizadeh and Talwalkar [40].This approach was used to find the optimal parameters for both SVM and ANN.This method allows dividing the training set into five subsets using four for training and one for validation.The combination of hyperparameters with the best resulting average performance is used for a single evaluation on the test set [40].There are different metrics to evaluate the performance of data mining models.The metric to use will depend on whether the task to be performed is regression or classification.The metric commonly used to evaluate classification models is accuracy, which is established as the percentage of correctly classified examples among the total number of classified examples.Higher accuracy means higher model performance.The accuracy can be expressed in the following formula: where TP is true positive, TN true negative, FN the false negative and FP the false positive.
In addition, we use the F 1 -score metric that combines precision and recall to evaluate classification models.The F 1 -score can be calculated as follows:

Characterisation of the Dataset
The characterisation of the data was performed using the R language.Averages and standard deviations were calculated for quantitative variables such as age and individual socioeconomic index, while qualitative variables were analysed using frequencies.It is important to highlight that the subjects presented four performance levels, but the initial levels 2 and 3 were combined due to the low amount of data in one of the two intermediate classes.
The average age of the students was 18 years with a standard deviation (SD) of 1.8.The average socioeconomic index among the students was 45.6 with a SD of 9.0.The most common type of document was the identity card with 86.5%, followed by citizenship card with 10.8% and 2.7% for civil registration.Regarding gender, 54.3% were women and 45.7% were men.The distribution of ethnicity in the students was: Zen with 9.9%, Afrodescendant with 2.4%, Embera 0.12%, Raizal 0.04%, Way 0.035%, Arhuaco 0.005% and another ethnic group 0.44%.Eighty-seven percent of the students did not belong to any minority ethnic group.The most common municipality of residence was Monteria with 26.8%, followed by Lorica with 7.8% (see Figure 5).This is due to Montería being the capital city of the Córdoba department.The frequencies for all the municipalities of Córdoba are shown in Figure 5.The highest educational levels for both fathers and mothers were incomplete primary and complete secondary school.The frequency percentages for these two factors can be seen in Figure 6.The most common type of work for fathers was farmer or day labourer, with 23.1%.Among mothers, the most common job was housewife, with 58.6%.The most frequent stratum was stratum 1 with 60.0%, followed by stratum 2 with 20.5%, stratum 3 with 7.1% and no stratum with 4.5% (see Figure 7).
The most common frequency of people in the household was 3-4 people.The percentage of this level was 38.1%.Families with 5 or 6 persons in the household corresponded to 37.7%.The lowest percentage of 4.8% corresponded to the level where only 1 or 2 persons lived in the household.Taking into account the number of rooms in the home, it was found that the highest frequency was two rooms with 43.7%.Then, different frequencies were observed, where we found three rooms with 33.2%, four rooms with 11%, for only one room we found 6.7%, for five rooms 2.9% and for more than five rooms 1.35%.With respect to the technological variables, 30.1% had access to a computer, 73.8% had a washing machine, 28.1% had a microwave oven, 65.5% had a closed TV service and only 8.2% had a video game console.The daily use of Internet was 27.6% between 30 and 60 min, 24.0% between one and three hours, 22.8% with 30 min or less, 11.3% with a prolonged use of more than three hours and 10% who do not use the service.
The possession of transportation vehicles is also one of the factors collected by the ICFES and taken into account for the prediction of academic performance in the Saber 11 tests.The results show that 10.9% have a car at home and 55.2% have a motorcycle.In relation to food, it was found that the highest consumption of milk and derivatives is 45.3% once or twice per week, 35.7% for meat, fish and eggs, and lastly 47.0% for the consumption of cereals, fruits and legumes.In Frequency (%)

SVM Models
The results of the application of a support vector machine on the ICFES dataset for the prediction of academic performance in the Saber 11 tests are shown here.The development of data mining models is considered an empirical work, where different parameters are tested and those that optimise the model are chosen.The cross-validation technique is useful for this case.Tables 6-10 show the cross-validation results and the optimal parameter combination for the model.The parameter combination with the highest accuracy was chosen.In addition, the accuracy and F 1 -score of the model on the test set can be seen in Tables 11 and 12.
The performance of the SVM models for predicting academic achievement varied with respect to the subject tested and the type of kernel used.The best performance of the model was for the subject critical reading, which obtained an accuracy of 93.5% for the three kernel types.With respect to the F 1 -score, the best performance was for the Sigmoidal kernel with 93.4%.For mathematics, the accuracy was 82.1% for both the linear and sigmoidal kernels, and the F 1 -Score was 81.7% for the linear kernel.The social science model obtained an accuracy of 72.9% and an F 1 -score of 72.8% for the linear kernel.The natural science model obtained an accuracy of 81.1% and an F 1 -score of 81.0% for both the linear and sigmoidal models.Finally, for English, the best performance corresponded to the linear and sigmoidal models, with an accuracy of 79.2% and an F 1 -score of 78.7% (see Table 11).Another parameter that was found to optimise the model was the parameter γ.It should be noted that this value was only used and configured for the radial or Gaussian kernel.The results of the first experiment show that the optimal value of γ that minimises the model error was 5000 for critical reading, mathematics and natural sciences, whereas C = 0.01.The high γ value might cause severe overfitting because of the highly complex decision boundary, where each individual training vector is extremely localised.Therefore, the decision boundary is highly sensitive to the position of each training vector.Nevertheless, the strong regularisation introduced by the low C value counteracts this effect by enforcing the regularisation.
On the other hand, for social sciences and English, the optimal value of γ was 0.005.The combined effect of a small γ and C = 10 causes the SVM to create a relatively smooth decision boundary due to the small γ while correctly classifying due to higher C value, which tolerates some misclassifications although still reducing the training error.This is because the higher the C value, the less regularisation there is.This trade-off between hyperparameters aims to achieve a model that generalises properly without either overfitting or underfitting.

ANN Models
The results of the application of a single hidden layer ANN on preprocessed data sets are presented below.There is no ideal network architecture for all applications.Such an architecture is obtained through simulations using cross-validation.In Tables 14-18, the results of the cross-validation technique for choosing the optimal parameters (regularisation parameter and hidden layer size) and choosing the best model to be tested with the test set can be observed.The combination of parameters with the highest accuracy was chosen.Moreover, the optimised models were tested on the corresponding test sets and it can be seen from Table 19 that the best performance was observed in the subject critical reading, where the model achieved 93.6% accuracy and an F 1 -score of 93.2%.On the other hand, the worst performance of the model was for the subject mathematics, with an accuracy of 82.2% and F 1 -score of 82.5%.Regarding the hyperparameters, different values were found.The most common lambda value was 0.01.The size of the hidden layer varied between 30 and 80 units, with a higher frequency for 30 and 50 units.

Discussion
The functional dependence between personal, socioeconomic and school variables with academic performance in the Saber 11 tests has not been widely explored in our country.Moreover, the use of categorical data to train data mining models is a major challenge to obtain decent results.An added factor is that ANN works better on unstructured data such as text, images, videos, etc.The application of ANN on tabular or structured data is also considered a challenge for educational data mining research.Nevertheless, this discipline is useful for finding patterns in data coming from educational institutions.
The two data mining techniques applied in the present research obtained an average accuracy above 80%, which is an important step in the development of educational data mining models in our country.The present work focused on applying ANN and SVM to predict the results of the Saber 11 tests.There is a work by Orjuela [24], where he used SVM for the subjects of language and mathematics, but the author implements a regression task and never reports the accuracy and performance of the model on a test set.
The performance of the algorithms was aided by the combination of two classes (performance levels) because the classes were unbalanced.The speed of training was aided by standardising the input features.Different research globally has found data preprocessing useful to increase the performance and convergence of algorithms [15,41,42].However, preprocessing the dataset does not always improve the model performance, this is evidenced in the work by Ahamed [43], where he found higher accuracy (78% vs 70%) on the non-preprocessed dataset using SVM.SVMs are algorithms that find optimal separating hyperplanes to classify the dataset.The results obtained from the application of the support vector machine on the ICFES data sets show that it is not necessary to use a special type of kernel to increase the dimensionality of the input feature space.The use of three different kernels shows no difference between the results.This indicates that the dimensionality of the input features in the dataset is sufficient to establish a separation between the classes.
ANNs have been successfully used for solving classification and regression problems that involve large-scale data sets.Indeed, several empirical studies have evidenced that the larger the dataset, the more accurate ANNs are, often outperforming other machine learning methods ( [44], p. 3).
Concerning the results of the ANN on the subjects evaluated by the ICFES, it is noteworthy that the performance of the algorithm is very similar in all combinations of the regularisation parameter with the number of hidden units in the hidden layer of the network.ANNs may find nonlinear relationships between input features and predictions of academic performance classes regardless of the size of the hidden layer.The outcomes in this study reveal that, in all these settings, the resulting new vector representation of the original input features found in the hidden layer enhances the accuracy of the predictions computed in the output layer.If the relationship between the input variables and the target variable is not linear, the relationship between the output of the neurons in the hidden layer and the target variable is linear.
In the experiments performed, only one hidden layer was used because the data used in the present investigation are tabular, and such a deep network is not necessary for the network to learn the weights that minimise the error or cost function.The results show that a maximum of 50 hidden units (in some cases only 30 units were needed) are required to find a functional dependence between the input variables and the predictions made by the network.The use of very deep networks for tabular data can lead to overfitting of the model with poor performance on the test set.
The present study shows that ANN performed better than SVM.This can be observed in the subjects natural sciences (93% vs. 81%), social sciences (86% vs. 73%) and English (85% vs. 79%).For the remaining subjects, the performance was similar (see Table 20).Academic performance in critical reading and mathematics could be correctly predicted 93% and 82% of the time with ANN and SVM, respectively.The present research does not use statistical tests to compare the results of the algorithms used because the metric used is not an average.In cases where a regression task is used and the metric is the mean squared error, tests such as Student's t-test or Mann-Whitney U-test can be used to compare means between two data mining techniques.Finally, besides the accuracy of ANNs, there is another advantage: they provide a probability output, which is useful for decision-makers and stakeholders.For instance, a student with a low probability (e.g., 15%) of achieving a B+ in English proficiency requires more training and possibly an intervention plan than a student whose probability of attaining this level is higher (e.g., 65%).The probabilistic nature of ANN makes it superior to SVM, as the latter does not inherently offer probability information with its predictions.Nonetheless, this limitation of SVM may be mitigated by using Platt scaling [45].

General Considerations
The prediction of academic performance has been investigated for a long time in different parts of the world.The increased use of information technologies has allowed educational data mining to become a reference for the improvement of education in educational institutions.The main objective of this research was to develop and implement two data mining models for the prediction of academic performance in the Saber 11 tests using personal, school and socioeconomic information as predictor variables.The use of SVM and ANN turned out to be suitable tools for a prediction of academic performance in high school students in the Saber 11 exam, where, on average, the subjects evaluated in the exam were correctly classified in an average of 82% and 88%, respectively.ANNs present good performance in different classification problems.This research work demonstrated that the predictive performance of ANN is superior to that of SVM on the Saber 11 test data set.With the results obtained in this research, a platform can be developed that allows secondary education institutions to monitor students and provide them with support to improve in the subjects they are having difficulties with.

Recommendations
This research presents some recommendations for future work.For example, use other data mining techniques (decision trees, random forests, Bayesian networks, etc.) or a combination of algorithms to improve performance results.Another option may be to use a dimensionality reduction technique such as PCA to find the principal components in order to improve training speed and model performance.On the other hand, collect data pertaining to all students from all over the country for several years.The present research only used data from Cordoba in the year 2017.With a larger amount of data, the functional dependence could be stronger and would significantly improve the results.Another example to improve the results of this research could be to use different optimisation functions as well as other techniques to cope with overfitting (e.g., dropout and early stopping).The present work used Adam (it has been shown to be very good on structured data), but there are different methods, such as downward gradient, RMSprop and Adaline, which were not used due to computational limitations.Finally, it would be interesting to determine which latent factors influence academic performance in the Saber 11 tests in order to generate strategies to help improve the quality of education in Colombia.To this end, we will adopt matrix factorisation algorithms (e.g., singular value decomposition, non-negative matrix factorisation, etc.) to compute those hidden factors from the input variables.

Figure 2 .
Figure 2. Error-tolerant soft margin hyperplane.When an instance is classified, four possible cases may appear: (a) indicates that the instance is on the correct side and away from the margin; (b) indicates that the instance is on the correct side and on the margin; (c) indicates that the instance is on the correct side but away from the margin.Finally, (d) indicates that the instance is on the wrong side: it is a misclassification [25].

Figure 3 .
Figure 3. Schematic representation of an ANN with a single hidden layer.The blue nodes represent the input variables (input layer), the red nodes represent the hidden layer, and the yellow nodes represent the output variables (output layer)[33].In this shallow network, with a single hidden layer, the inputs to the artificial neurons in the output layer are not the original variables.Hence, this fully connected network maps the input variables in a p-dimensional space to an M-dimensional space, where there might exist a hyperplane decision boundary to classify the input vectors.Thus, ANNs are known as universal approximators, where the output layers calculate their predictions based on a new representation of the original input variables obtained in the hidden layers.

3. 3
.1.Forward Propagation It consists of calculating the model output values and the corresponding error between the predictions made and the correct values of the training examples.If we assume an ANN with a single hidden layer, W and b are the parameters to be updated, X are the input features, Z is the linear function, sigma(•) is a nonlinear activation function.The following equations demonstrate the calculations performed in the forward propagation process.

Figure 4 .
Figure 4. Number of records per class distributed along the 5 subjects evaluated in the test Saber 11.

Table 1 .
Variables of personal information collected from students in the Saber 11 test.

Table 2 .
Socioeconomic variables collected from students in the Saber 11 test.

Table 3 .
School-associated variables collected from students in the Saber 11 test.

Table 4 .
Classification of the level of performance in the Saber 11 test.

Table 5
, all the frequencies of weekly food consumption are shown.

Table 5 .
Frequency of weekly food consumption by students.

Table 6 .
C and γ optimisation for radial kernel in critical reading.

Table 7 .
C and γ optimisation for radial kernel in mathematics.

Table 8 .
C and γ optimisation for radial kernel in natural sciences.

Table 12 .
F 1 -score of the SVM models using different kernel types.

Table 13 .
Optimal values of the regularisation parameter C used in SVM.

Table 14 .
Optimisation of the regularisation parameter and size of the ANN hidden layer for critical reading.

Table 15 .
Optimisation of the regularisation parameter and size of the ANN hidden layer for mathematics.

Table 16 .
Optimisation of the regularisation parameter and size of the ANN hidden layer for natural sciences.

Table 17 .
Optimisation of the regularisation parameter and size of the ANN hidden layer for social sciences.

Table 18 .
Optimisation of the regularisation parameter and size of the ANN hidden layer for English.

Table 19 .
Accuracy and optimal hyperparameters of the ANN models.

Table 20 .
Accuracy comparison between SVM and ANN models.