Assessing the Relevance of Information Sources for Modelling Student Performance in a Higher Mathematics Education Course

: In recent years, most educational institutions have integrated digital technologies into their teaching–learning processes. Learning Management Systems (LMS) have gained increasing popularity, particularly in higher education, due to their ability to manage teacher–student interactions. These systems store valuable information which describes students’ behaviour throughout a course. These data can be utilised to construct statistical models that represent learner behaviour within an online LMS platform. In this study, we aim to compare different sources of information and, more ambitiously, to provide insights into which source of information is most valuable for inferring student performance. The considered sets of information come from (i) the Moodle LMS; (ii) socio-economic data about students acquired from a survey; and (iii) subject marks achieved throughout the course. To determine the relevance of the incorporated information, we use artiﬁcial intelligence (AI) methods, and we report the importance measures of four state-of-the-art methods. Our ﬁndings indicate that the selected methodology is suitable for making inferences about student performance while also shedding light on model decisions through explainability.


Introduction
Improving the quality of education is a crucial objective for achieving sustainable development, as recognised by the United Nations in its fourth Sustainable Development Goal (https://www.un.org/sustainabledevelopment/education/ (accessed on 15 May 2023)). Access to quality education is essential for enhancing people's lives and promoting sustainable development. A primary objective related to higher education is to ensure access to affordable technical, vocational, and higher education, as well as to expand scholarships for developing countries in these fields.
In particular, mathematics has garnered special attention due to its cross-disciplinary nature and connections with teaching-learning methodologies such as STEM education [1], which emphasises Science, Technology, Engineering, and Mathematics. Mathematics plays a critical role in Computer Science programs in higher education. Specifically, it serves as one of the core foundations for developing theories and methods in computer and information sciences. The formalism and logical language of mathematics are instrumental in fostering computational reasoning and thinking.
In recent years, Learning Management Systems (LMSs) have been widely adopted by universities. As a result of the COVID-19 pandemic, many face-to-face courses transitioned to fully online or blended learning [2] environments, guided by educational, hygienic, and/or political considerations. This shift in educational paradigms has substantially impacted the conventional approaches to teaching, learning, and interaction for both teachers and students.
The cornerstone of educational technology is the Learning Management System (LMS). It is typically an online platform designed to organise, invigorate, mentor, assess, manage, and administer learning activities [3]. Its primary responsibilities include managing users (students, teachers, and administrators), resources, and activities, as well as monitoring the educational process through assessments and reports. Furthermore, it equips members of the educational community with communication tools such as internal messaging, chats, video conferencing, forums, and more. By utilising such a virtual platform, teachers and students can benefit from accessing and sharing a unified information source. In recent educational research, the subject of student performance analysis and prediction has attracted considerable attention [3,5,6]. Traditionally, inferences about student grades have been made using various offline data sources, such as student grades, demographics, social, and school-related characteristics, primarily collected through school reports. This methodology is demonstrated in [7]. The use of LMS has introduced the incorporation of student data obtained through an online platform when examining student behaviour throughout the course, including aspects such as activity, platform engagement, assistance, and assessments.
The current era is predominantly defined by technological developments such as computers and intelligent mobile phones, fueled by the vast number of data collected and generated by both humans and machines [4]. These technological tools driven by data are closely intertwined with advancements in mathematics. Increasing success in higher education, particularly in mathematics, is of crucial importance [8].
Computer tools, such as data logging systems, graphing tools, simulation, and modelling environments, can also influence learning by facilitating changes in classroom interactions [9]. The information captured by these systems has been shown to be useful in describing students' behaviour [6,10]. Furthermore, the study by [4] emphasises the importance of user interactions on LMS platforms for obtaining relevant information about students. Despite the breadth of literature on this topic, we found that most works focus on the stored information within LMSs [11]. This line of study has received special attention [12] with the advent of Artificial Intelligence. Moreover, it has been extensively investigated [13] to gain a better understanding of student behaviour and to design appropriate educational environments that facilitate the teaching-learning process.
In [14], the authors discuss the limitations of certain electronic learning environments and advocate for math-friendly systems that allow for the exchange of mathematics diagrams and notation between instructors and students. The study concludes that learning difficult concepts in these math-friendly environments is comparable to doing so in face-toface courses. We recognize the importance of using such math-friendly environments and have opted to utilize the Moodle Learning Management System (LMS), which supports the use of L A T E X (https://www.latex-project.org (accessed on 15 May 2023)) for sharing diagrams with students. LaTeX has proven to be the predominant language for document preparation in the scientific community, particularly in the field of mathematics, due to its compatibility with mathematical symbols and syntax. The use of multimedia sources has also been demonstrated to be effective in mathematics education. For example, in [15], the authors employed an interactive, multimedia-based instructional system in a mathematics methods class for pre-service elementary school teachers. Their findings revealed that students were more likely to integrate knowledge acquired from the system into their teaching methods compared to conventional approaches. LMSs have also proven to be valuable tools during the COVID era [16], where teachers confronted the complexities of online instruction and developed innovative forms of collaborative work.
In this study, we investigate three different sources of information to make inferences about student performance through artificial intelligence (AI) models. Specifically, we examine three distinct approaches. Firstly, we used purely LMS-generated data from the platform collected in the course log-file. We adopted the methodology recently proposed in [6]. This information is based solely on activities conducted within the LMS and encompasses data in the Event Name and Event Context columns of the log-file. It quantifies students' interaction in terms of activities, file downloads, forum participation, class attendance, and more. For further details, see Table 1. Secondly, we used data acquired from a survey that gathers socio-economic features of students. We adapted the survey proposed in [7]. The set of features collected in the survey is detailed in Table 2. Finally, we used students' marks generated during the course, which contribute to their final course grades. In-depth information about the marks can be found in Table 3. By analysing these information sources, we aim to better understand and predict student performance in educational settings.   Videos' mark 4 Seminars' mark 5 Practicals' marks 6 Voluntary homework's marks The remainder of this work is organised as follows. Section 2 reviews the materials and methods used in the present study. It offers insights into the AI methods used to make inferences about students' performance and their corresponding model interpretation. Additionally, it explains the data acquisition and harmonisation procedure. Section 3 presents an exhaustive description of the experiment and the results achieved, along with their interpretation. Finally, in Section 3 provides the concluding remarks and some suggestions for possible extensions of our work.

Materials and Methods
This section elaborates on the statistical learning methodology employed in the experimental setup, offering a broad comparison across four different methods utilised in the machine learning field. Furthermore, it delves into the details of the data set designed specifically for this study. In particular, we provide a comparison among data generated from the Moodle LMS, a socio-economic survey, and students' marks achieved in the subject. In our study, we propose applying the artificial intelligence techniques in order to exploit the LMS-generated data [13]. This combination of AI and educational data exploitation has acquired relevant interest in the Educational area [12].

Statistical Learning Models
In this section, we outline the various methods employed in the experimental Section 3, along with the metrics used to measure their performance.
In particular, we used four state-of-the-art methods for predicting student performance in the teaching subject of computer science within the Mathematics degree at the University of València. We considered these four models due to their strong performance in regression tasks and their ability to provide understandable explanations for their decisions through model weight inspection. These models are the Gaussian Processes regression (Section 2.1.1) (GP), a powerful nonlinear model; the Partial Least Squares (Section 2.1.2) (PLS), a model useful in small sample size problems and robust to multicollinearity [17]; LASSO (Section 2.1.3), a model that yields a sparse representation; and Ridge Regression (Section 2.1.4) (RR).
In the following, we consider a data set D of n observations, where the d-dimensional input vector x i , referred to as the feature vector, contains a representative set of d features representing the i-th student, and the output scalar y i represents the associated student's mark in the subject. The goal of our work is to fit a model f between input feature vector x i and the corresponding output value y i , i.e., f (x i ) = y i , i = 1, . . . , n. One advantage of this model is its ability to make inferences on a new, unseen feature vector x * through f (x * ). Recent AI techniques rely on the Empirical Risk Minimisation principle and provide risk guarantees through Statistical Learning principles, which offer statistical guarantees and good performance in model fitting [18,19]. All the methods used in this study are based on this theory and are revisited below.

Gaussian Process Regression
Gaussian Process Regression (GPR) [20] is a probabilistic model that offers a nonlinear least squares regression model through the use of kernel methods [21]. Specifically, we used the Automatic Relevance Determination (ARD) kernel covariance function, where Σ is a diagonal matrix with a diagonal composed of {σ 2 1 , . . . , σ 2 d } parameters to weigh each input dimension. The ARD kernel is a natural extension of the Radial Basis Function (RBF) with only one parameter, as it weighs each feature independently (the scale factor is ignored in both covariance functions for the sake of convenience). For a comprehensive understanding of the model formulation and a broader study of Gaussian Processes theory, refer to [20]. One advantage of using the ARD kernel is that it allows for estimates of the importance of each variable. In particular, we define the importance measure of the i-th . This measure reveals the importance assigned to the i-th input feature by the GP model. We choose this method for several reasons. First, it is nonlinear and provides robust estimations when the relationship between input and output data is nonlinear. Second, since it is based on a probabilistic approach, it offers confidence intervals for inferred values. Third, the ARD kernel allows for the establishment of an importance ranking for input features, shedding light on the model behaviour during the inference process.

Partial Least Squares
Partial Least Squares [22] (PLS) is a statistical method that finds a linear regression model by projecting the predicted variables and the observable variables onto a new space. The PLS model seeks to identify the multidimensional direction in the input space of X (formed by the input vectors) that accounts for the maximum multidimensional variance direction in the output space (formed by the values in y). PLS regression is particularly suitable when the predictor matrix has more variables than observations (d > n) and when there is multicollinearity [17] among the input variables.
In summary, the PLS model is advantageous when there are more dimensions than input samples. Moreover, it is well-suited to deal with multicollinearity, which refers to the presence of highly correlated input features.
All models have been statistically trained through the leaving-one-out procedure [23] (LOO), a robust statistical technique with theoretical guarantees. This technique selects one sample for model testing and uses remaining samples for model building. Once the model is trained with the n − 1 samples, it is tested over the remaining samples. This process is repeated n times by permuting the sample, and the final averaged error is provided.

LASSO
Some models are considered black-box models as they do not provide information about their decision-making process. One approach to circumvent this issue is to examine the model weights, which express the importance or relevance of the input variables in the inference process. We propose using L1-constrained linear least squares fits [24] (LASSO). LASSO is a least squares problem formulation with an L1 penalty term applied to the model weights. This method enforces sparsity on the model weights, aiming to set non-relevant features in the input data to zero. Consequently, this allows for a clearer representation of which input features are relevant when the model performs inferences.

Ridge Regression
The method of least squares regression (LS) is a standard inference technique. It utilises a data matrix X ∈ R n×d , containing real-valued observed variables. Here, n represents the total number of samples, and d is the number of variables or covariates in the study. The goal is to make inferences about another variable y ∈ R n through linear model weights w ∈ R d . In this case, the estimated output variable isŷ = Xw. One of its limitations arises in the presence of multicollinearity when some input variables are correlated, which is the case in our study. Several alternatives exist to address this issue, one of the most accepted being Ridge Regression (RR). In the context of multicollinearity, standard linear regression may result in poor estimates. To overcome this, a regularisation term can be added to the original least squares problem, known as Tikhonov regularisation, which leads to ridge regression. The regularisation term can be adjusted to match the amount of noise in the input data.

Data-Set Harmonisation
In this section, we describe the three different sources of information: the Moodle (https://moodle.org/ (accessed on 15 May 2023)) LMS, a survey capturing students' socioeconomic characteristics, and scores obtained in various tasks such as homework and coursework.
Our student sample consists of a full course of computer science within the mathematics degree. In particular, the study was developed with a total of 18 students in a first-semester course ranging from September to December of 2022.

Moodle Log-File Data
We adopted the methodology proposed in [6] to extract an informative set of variables from the Moodle platform throughout the subject. This approach is effective for obtaining activity data generated on the online LMS platform. It involves creating occurrence matrices from the raw log-file of the online course by counting the amount of activity generated by each student in terms of the Event Name and Event Context components. The former, Event Name, pertains to activities such as questionnaires or quizzes proposed by the teacher, while the latter, Event Context, refers to Moodle's internal categorisation of the created Event Name. Table 1 presents the 38 features exhibiting activity during the mathematics course within the mathematics degree. In a log-file, the LMS sequentially stores raw log data. Event observers are unable to alter event data or stop the dispatching of events since the communication connection is one-way. The variables that Moodle stores serve as the primary source of data for the inference procedures that are taken into consideration in this study. The raw Moodle log-file, which is kept in plain text format, is specifically made up of a series of user-performed events. This log file only records a student's activities within a course; it excludes other log data, such as internal system mistakes. As a result, it offers a wealth of data that may be used to identify and describe student activities throughout the teaching-learning process. In particular, the information used has been shown to comprise practical variables to aid in inference [5,6,25]. Table 2 displays the various features used to describe the socio-economic aspects of the student population in this study. We administered a test comprising 29 features, as outlined in [7]. Specifically, the attributes from 2 to 30 can be found at the url cited therein (https://archive.ics.uci.edu/ml/datasets/student+performance (accessed on 15 May 2023)). For the sampled students, certain variables remained constant, such as the lack of extra educational support (schoolsup) received by any of them, the absence of attendance at nursery school (nursery), the presence of internet access at home (internet) for all, and no alcohol consumption on workdays (Dalc). These data were obtained through a survey administered to the students.

Course Marks
To fully investigate the topic at hand, we examine the impact of student performance throughout the course. In particular, we consider the grades presented in Table 3, which include marks for classroom assignments, exams, seminars, practicals, and voluntary homework.

Correlation Analysis
In this section, we analyse the correlation between input variables, as well as the correlation between input variables and the output variable.

Correlation between Input Variables
We computed the correlation between observed variables within each data set source separately. Recall that we have three different sources of information: Moodle LMS, socioeconomic data, and subject marks.
Our data correlation study detected multicollinearity, which refers to the presence of high-correlation coefficients ρ between variables. We consider the Pearson correlation coefficient [26] to measure the correlation between two variables x, y as shown in the equation We deliberately chose the above-proposed method due to its robustness under the assumption of multicollinearity among input variables. Multicollinearity occurs when one predictor variable in a multiple regression model can be linearly predicted from the others with a high degree of precision. In this case, minor adjustments to the model or the data may result in unpredictable changes in the coefficient values of the multiple regression changing. However, in our sample data set, multicollinearity only affects computations related to specific variables and does not impact the overall predictive potential or reliability of the model. We also included results in terms of the Spearman's rank correlation coefficient [27] r s = ρ R(x),R(y) , where the n raw scores x i , y i , with 1 ≤ i ≤ n, are converted to ranks R(x i ), R(y i ). This means that r s ∈ [−1, 1] provides the highest values (recall that it is bounded by 1) if higher values in x corresponded with higher values of y. Figure 2 represents the percentage of input variable pairs x i , x j , 1 ≤ i, j ≤ n that exhibit a correlation |ρ(x i , x j )| > ρ 0 , for a fixed value of ρ 0 ∈ [0, 1]. This provides an indication of the number of correlated variables for a given threshold ρ 0 . At a threshold of ρ 0 = 0.2, the LMS and socio-economic data sets display a correlation of around 40% of variables. However, the proportion of variables with a correlation value above ρ 0 > 0.5 is less than 10% in these data sets. The course marks data set, which is the smallest, shows a slower decay of the correlation values but also exhibits a relatively low proportion of correlated variables.

Correlation between Inputs and Output Variable
Before studying the correlation between the input variables and the output variable, we will provide some information about the marks (the output variable). Figure 3 illustrates the kernel density estimation of the students' final marks. It is visually evident that most of the students passed the subject. Around 2% had a score of 3 (out of a maximum of 10), while there was a peak around the 8.5 mark value, indicating the good performance of the students in the subject.    Table 1). The variable with the highest positive correlation coefficient is 32, corresponding to Status' submission has been viewed. In Figure 4b, the correlations of socio-economic variables with y are shown. The two variables with the highest correlation coefficients are variable 4 (Family size), with a coefficient over 0.6, and variable 16 (Extra paid classes), with a negative coefficient (refer to Table 2). In Figure 4c, the correlation among the students' course marks is shown, with a total of four out of six variables achieving relatively high correlation coefficients, with values over 0.7 with the output variable y (refer to Table 3).

Feature Ranking Analysis
One of the key features of machine learning (ML) methods is their ability to be interpreted as black-box models [28]. Black-box models are systems or processes that can be constructed based solely on their inputs and outputs, without any understanding of how they operate internally. However, the proposed ML models can be interpreted through the inspection of their weights, which provides information about the relevance of the variables in the final trained model [29]. In the following sections, we provide the weights of the models in absolute value and normalised form for the PLS, LASSO, and RR methods. For the GP method, the values of the kernel ARD (see Equation (1)) are transformed to | log σ 2 i | and then normalised to sum up to 1. We provide a detailed study of the weights of the models, as they contain valuable information about the relevance of the variables in the final trained model.  Table 1). Most of them are related to the activity of the students along the Moodle platform, particularly downloading and resolving tasks and quizzes, which quantify the continuous evaluation of students and are the most relevant features for the inspected models.The most balanced features among the methods coincide with the first and third most relevant features, numbers 38 and 22. It is worth noting that the GP considers the second most relevant feature, number 14, to be the least relevant, which is related to system maintenance tasks. The least weighted features are 3, 17, and 36, which refer to Badge listing viewed, Group updated, and User list viewed, respectively. These features are related to student inspection of lists about the course, such as seeing the rest of the students. It seems reasonable that inspecting lists does not affect the final student mark as much as interacting with the continuous evaluation tasks mentioned above. Among the proposed methods, only the LASSO model enforces sparsity on the weight models through the minimisation of the L1-norm. As can be seen visually, the LASSO bars are usually smaller than the bars of other methods and take values of zero or closer to zero in most cases.  Table 1. Figure 6 displays the weights of the features that are related to the socio-economic survey, as calculated by the models used in this study. Notably, the top five most relevant features (i.e., those with the highest bars) are features 16,9,4,18, and 15, respectively. These numbers correspond to extra paid classes, father's job, family size, and family education support, as listed in Table 2. It is interesting to observe that both extra academic support and family education support are deemed relevant by the models. Additionally, variables related to family, such as the father's job and family size, appear to influence the model's predictions. While the study time variable (number 13) did not make it to the top five, it still received a relatively high score in terms of its relevance.  Table 2.

Subject's Marks
In each particular subject course, there is a categorisation of marks that ultimately contribute to the final grade earned. These marks have been presented in Table 3. Our aim is to identify the most significant source of marks for the models under consideration. Figure 7 depicts the weights attained by the models. The Practical marks feature attained the highest weight, followed by the Video marks and the Seminar marks, corresponding to features 5, 3, and 4, respectively. Notably, the GP method attributed less weight value to the Practical marks and more to the Video marks and Voluntary Homework marks.  Table 3.
As a concluding remark to this experiment, analyzing the models' weights provides valuable insights into the factors that have the greatest impact on students' final marks. This quantitative approach can be useful in comparing different methodologies and evaluation strategies and in designing and validating them. Thus, the present experiment has shed light on the importance of different parts of the methodology in determining students' marks and has highlighted the significance of factors such as practices, videos, and seminars in this regard.

Student Performance Analysis
This section provides a comprehensive comparison of the various methods and data sets used in the study to evaluate student performance. We evaluate the performance of four machine learning methods: Gaussian Processes (GP), Partial Least Squares (PLS), LASSO, and Ridge Regression (RR). The metrics chosen to measure the accuracy are the Root Mean Squared Error (RMSE), which indicates the averaged error between the estimated student marksŷ and the true values y according to The Coefficient of Variation (CV) is a measure of dispersion of a frequency distribution; in our case, we compute the CV = σ µ as the quotient between the standard deviation and the mean of the achieved RMSE. Figure 8 displays the box plots of the proposed methods (GP, PLS, LASSO, and RR) across the three considered data sets. This information is presented in the first row of the figure. The median value of the GP model, denoted with the red line, achieves the best result in all three data sets. The second row of the figure shows the scatter plots of the proposed methods. In order to further evaluate the performance of each method and dataset, the correlation values are presented in Table 4. Table 4 presents various metrics and statistics on the performance of the proposed methods in the student performance task across the three different scenarios: the Moodle LMS data set, the socio-economic survey, and the marks obtained in the mathematics subject. The first row-block displays the Root Mean Squared Error (RMSE) of the different methods, where the best results are shown in bold and belong to the GP and RR methods. It is worth noting that GP is a nonlinear method that can fit the data to more complex relations between inputs and outputs, in our case, between features and marks. Ridge Regression (RR), on the other hand, is a regularised version of the classical least squares, which leads to good results when dealing with multicollinearity relations. In the second row-block, the results achieved for the Normalized RMSE (NRMSE) are shown, where NRMSE = RMSE y max −y min . This measure provides a percentage interpretation of the result, and the best results, highlighted in bold, were achieved by the GP model and Ridge Regression (RR). The third row-block contains results about the Coefficient of Variation (CV), which is a quotient between the mean and the standard deviation of the errors. It quantifies the ratio of dispersion in the achieved results and expresses the precision and repeatability of an experiment. Typically, a value of CV > 1 is considered as representing high-valued variance distributions. In our results, in the subject marks data set, values of CV > 1 are marked in bold. Finally, the Pearson correlation coefficient ρ reveals the same conclusions as the RMSE but in a bounded range of ρ ∈ [−1, 1]. The best results are presented in bold, and they can be compared with state-of-the-art studies presented in the literature [5,6].  Pearson Moodle 0.33 (2 · 10 −1 ) 0.25 (3 · 10 −1 ) 0.30 (2 · 10 −1 ) 0.30 (2 · 10 −1 ) Socio-Economic 0.24 (3 · 10 −1 ) −0.06 (8 · 10 −1 ) 0.07 (8 · 10 −1 ) 0.42 (8 · 10 −2 ) Subject marks 0.22 (4 · 10 −1 ) 0.74 (5 · 10 −4 ) 0.77 (2 · 10 −4 ) 0.81 (5 · 10 −5 )

Conclusions from the Presented Study
In this study, we have presented a comprehensive comparison of different sources of information on students' behaviour and its components during a computer science course in the mathematics degree. Student performance is a critical area in education that requires a rich set of relevant features to build specific models that can describe this task. Our study provides a broader comparison of different sources of information commonly employed in the related literature but rarely reviewed together.
We conducted an exhaustive comparison of the data sets using four state-of-theart machine learning and artificial intelligence methodologies. The inspection of model weights allowed us to gain insights into which features are more relevant to affecting the final student's marks. The results obtained in the student performance task are comparable with other state-of-the-art methodologies published in the educational area, indicating the effectiveness of the proposed approach.

Limitations and Future Work
Limitations of our work can be found in the number of years used in the study. In addition, the benefits can be further explored, and more robust conclusions can be obtained from a wider study in terms of the timeline. Also, the present study was centred on a single university, a point which can be further considered to expand in future work.
In future research, one promising direction is to explore the combination of multiple sources of information to build more accurate models that can capture the complex interplay of various factors influencing student performance. This may involve developing novel feature-engineering techniques or adopting more advanced machine learning algorithms to leverage the heterogeneity and complementarity of different data sources. Furthermore, extending the comparative study to include additional courses or educational contexts could help generalise the findings and uncover new patterns and challenges in the prediction of student performance. Such investigations could also contribute to the development of more generalised and transferable models that can be applied to diverse educational settings.