Future Generation Computer Systems

Due to the COVID19 pandemic, more higher-level education programmes have moved to online channels, raising issues in monitoring students’ learning progress. Thanks to advances in online learning systems, however, student data can be automatically collected and used for the investigation and prediction of the students’ learning performance. In this article, we present a novel approach to analyse students’ learning behaviour, as well as the relationship between these behaviours and learning assessment results, in the context of programming education. A bespoke method has been built based on a combination of Random Matrix Theory, a Community Detection algorithm and statistical hypothesis tests. The datasets contain fine-grained information about students’ learning behaviours in two programming courses over two academic years with about 400 first-year students in a Medium-sized Metropolitan University in Dublin. The proposed method is a noval approach to data preprocessing which can improve the analysis and prediction based on learning behavioural datasets. The proposed approach deals with the issues of noise and trend effect in the data and has shown its success in detecting groups of students who have similar learning behaviours and outcomes. The higher performing groups have been found to be more active in practical-related activities throughout the course. Conversely, we found that the lower performing groups engage more with lecture notes instead of doing programming tasks. The learning behaviours data can also be used to predict students’ outcomes (i.e. Pass or Fail the terminal exams) at the early stages of the study, using popular machine learning classification techniques.


Introduction
Education in Computer Programming and related domains has received increasing attention due to the growth in demand for Information Technology (IT)-related job markets. Furthermore, in recent years, STEM fields (science, technology, engineering and mathematics) also require essential IT skills and knowledge, making these types of skills an integral part of most STEM subdisciplines such as Artificial Intelligence, Bio-informatics, Statistics etc. One of the pivotal and essential courses in any IT-related degree is a set of programming courses so understanding and improving students engagement and process of learning are of key importance. However, despite the necessity of these skills, there have been considerable drop-out rates in introductory programming courses reported from many studies [1]. The failure rate in introductory programming modules has been reported to be 28% on average, with a huge variation from 0% to 91% [1], according to a recent study using data from 161 universities around the world. amount of educational data collected during the learning process also has the potential to help instructors and students to obtain a comprehensive view of a student's learning progress. This insight enables the possibility of evidence-based interventions and recommendations [5] which might have effects on learner perception, learning patterns and learning outcomes [6].
In terms of automatic behavioural log data, there is the potential for the effect of noise and trend to be present in the automatically collected data. Students can work flexibly when completing their learning paths in the online learning system. For example, they can carry out various learning activities such as reading lecture notes, coding, navigating among course documents in any order, resulting in noise in the logged data, i.e. data heterogeneity and complexity [7]. In addition, we have noticed from the data gathered that students are likely given the same instructions and learning pathway in the same class. As a consequence, this may create a trend effect, i.e. students' learning behaviours can be similar and highly positively correlated with other learners' behaviours in the same course. Hence, it is important to filter noise and clean the trend effect in the event log data before applying further analysis.
This research aims to investigate the relationship between students' learning behaviours on course material items and their performance in the exams while taking programming-related courses in online learning systems. Specifically, the research objective is to answer these research questions: • RQ1. Do students from different groups, corresponding to different patterns of learning behaviours, perform differently in the exams? If this is the case, how do such groups interactions with items of course material differ?
• RQ2. Is there potential for students' learning behavioural data to be used to predict learning outcomes (Pass or Fail) at the early stages of the study period?
To address the research questions, we investigate about 400 university students participating over the two programmingrelated courses during the two academic years (2017/2018 and 2018/2019). The courses have been delivered to students in a combination of conventional and online formats. In particular, students have physically attended the lecture sessions in lecture halls, and have conducted all learning activities on a bespoke online system. The learning data were logged and these serve as input datasets for further analysis here. Behavioural data captured automatically from the system is stored in the format of an event log. From the input event logs, the concept of a student-event item data matrix and a transition-student data matrix (described below) have been developed to represent the students' learning behaviour. To deal with the problem of noise and trend effect in the datasets, we utilise the cleaning methods based on Random Matrix Theory (RMT), followed by the construction of Minimum Spanning Trees (MST) to reflect the difference in learning behaviours of all students. Community detection algorithms and statistical tests have also been applied to investigate the students' behaviours on course material items. For the prediction of learning outcome, a set of machine learning algorithms have been applied into every week's original and cleaned data and the predictability has been validated by cross-validation technique.
The rest of the paper is organised as follows: Section 2 discusses the related works; Section 3 describes the context of the study, data and methods; Section 4 provides detail of the experimental results; Section 5 discuss the implications and limitations, followed by the conclusion in Section 6.

Analysis using learning behavioural data
Much research has been carried out to determine the relationship between the learning behaviours and performance of students [8,9]. In [10], the authors investigated a variety of learning activities such as collaborative activities and giving feedback by using data from 13 participants in an experimental setting class. The effect of the diversity of learning styles on learning scores and satisfaction has also been tested in [11], using the data from an online forum and survey data from 144 students. Although these efforts consider a wide range of learning activities, they have been carried out with small size samples so survey data was still required for the analysis. In the context of this paper, we utilised datasets from a large number of students in two modules over the two academic years, i.e. 112 students in Course#1 2018, 151 students in Course#1 2019, 62 students in Course#2 2018 and 48 students in Course#2 2019. The datasets were automatically collected during the study from our bespoke online learning platform.
Complementing the work above, the analysis of massive learning behavioural log data has been supported with the emergence of Educational Process Mining (EPM), and the application of process mining [12] techniques and algorithms in educational event log data [13]. Recently, EPM appears to be an effective tool to analyse educational data and deliver new insights into the learning and teaching processes. Many of the process mining applications in education have been developed and implemented in various aspects of education [14,15]. The applications of Massive Open Online Courses (MOOCs) attract the most attention from researchers due to the availability of the input log data [7]. The majority of applications in EPM aim to discover learning patterns from the input data so-called event log, resulting in learning process models. However, there may be challenges for the process discovery approach when there are many complicated and noisy event logs. In such cases, the process mining techniques would likely produce 'spaghetti-like' process models which can be incomprehensible. It is also not trivial to combine many process models for the interpretation [16]. This research adopts the notion of 'event-log' in EPM as a storage format of the collected dataset. However, instead of generating complex process models, we extract features from the logs and then propose a method to clean the extracted dataset. We suggest that this cleaning method can separate the information part from the noise in the dataset, and, in the process, improve the performance of community analysis, i.e. to produce more logical coherent communities in terms of their learning performance, as well as to generate better predictive models of learning outcomes.
In terms of the network-based approach in Education, there are a few existing works using Community Analysis and Minimum Spanning Tree, which are most similar to this paper. Both [17] and [18] studied the network structure of undergraduate courses and their contributions to students' learning pathways. However, the authors merely considered the courses' grades from a relatively small number of students. We not only utilise exam results but also consider student learning behaviours by extracting behavioural data features from large automatic collected datasets.
Regarding programming educational context, in [19], the author found that practice is essential for improving students' programming skills and students should be given opportunities to practice and receive constructive feedback. In [20], the author developed metrics for use as formative assessment tools to analyse (successful and unsuccessful) students' learning patterns. However, these approaches have focused mostly on practical activities such as coding and solving programming tasks. Our approach considers comprehensive learning activities that students do in their programming study, i.e. coding, reading lecture notes and labsheets.

Learning outcome prediction
Prediction of student performance has been one of the most popular topics in Learning Analytics in recent years [4,21]. In general, input data used for the prediction can be classified into two categories: static and dynamic data [6]. Student demographic information and historical educational records can be classified as static data because these variables and values do not update or change frequently over the study period. On the other hand, online behaviours, textual data and other multimodal data can be considered as dynamic data because they can be continuously generated when students are interacting with the system. In terms of the prediction of learning outcome from the static data, one can rely on features such as personal attributes and cumulative grade point average (CGPA) from previous years [22]. For instance, using learning grades from previous courses can predict the drop-out probabilities of computing students [23]. However, the use of static data to predict learning outcomes has been shown to cause some problems [6], i.e. the student's actual efforts during the learning process can be ignored. It has also been found that previous student results (e.g. the CGPA) are not sufficient to predict drop-outs, and engagement variables also need to be included (e.g., number of accesses to the platform) to achieve good accuracy results [24]. The static data can also be difficult to collect as they may need to be merged from various data sources, which might cause data quality and ethical issues.
On the other hand, dynamic features such as behavioural data can be collected easily due to their availability in the advanced learning platform. It has been shown that it is possible to rely on dynamic data [25] to predict student learning outcomes. For example, learning log data from the Moodle platform has been used for predicting learners' performance [26], using common features in a Learning Management System such as Assignment, Feedback, Course login and Chat. More fine-grained data can be used as predictors such as mouse interactions (e.g. click and drag) [27]. Multi-modal features (e.g. eye-tracking, face-video and wristband) have also been demonstrated its predictability of learning performance [28]. Generally, according to the recent surveys, most of the current approaches have a focus on either combining new features collected from learning platforms or new strategies with different machine learning predictive models [6,21]. However, to the best of our knowledge, none of the existing research directly deals with the issues of noise and trend in educational data, which, we feel, may have a negative influence on prediction models. One of the most common methods to pre-process data is Principal Component Analysis (PCA) which has been applied in different areas such as education [29], medical [30] and network security [31]. Although PCA supports the selection of the most relevant features, which may help to unintentionally eliminate noise in the data, the trend effect remains. The evaluation of the predictability of the behavioural data at early stages also remains limited [6]. Based on Random Matrix Theory, our approach aims to identify and separate the key information part from the noise, which enhances the performance of the prediction models with cleaned datasets in comparison with original and PCA-based processed datasets.

Context of the study
This research has been carried out based on four datasets representing the learning behaviour of students and their performance in two programming-related courses on a Computer Science (CS) programme in a Medium-sized Metropolitan University. The first course is a first-year introductory programming module that is delivered to Software Engineering students. These students generally have the aim of targeting programming-related jobs such as software development. The second course is a programming module taken by first-year Business Computing students who are usually looking for non-technological positions in an IT-related field. We denote the two courses as Course#1 and Course#2, respectively.
In both courses, learning material items are provided to the students on a weekly basis. Course items include general course information, lecture notes, labsheets and programming tasks. Students should read lecture notes during a lecturing session. In a lab session, students should follow instructions and examples in labsheets and do given programming tasks. The solutions to the tasks are uploaded and tested automatically by the system. Course items are delivered in the form of web pages on the bespoke online learning system. We formalise the course material items in this context as material type (i.e. General, Lecture, Labsheet and Practice) combined with the corresponding week, e.g. Labsheet_1 means the labsheet used in week 1. For the general documents, we denote them as General. Students' interactions with the items (e.g. mouse clicking or scrolling, highlight a piece of text or switching between two items) are logged automatically on the database.
There are three lab exams in Course#1, which take place in weeks 4, 8 and 12 (the final week of the semester) while Course#2 students have to take two lab exams in weeks 6 and 12. On finishing a programming lab examination task, students submit their codes to the systems and get the results as ''correct'' of ''incorrect'' submission. A submission is considered ''correct'' if it passes all the test cases which are pre-defined by the instructors. Each task is given the same mark proportion and the overall mark is given to students after the exam is finished. A student whose grade is fewer than 40 out of 100 is labelled as ''lower-performing', otherwise, that student is considered as ''higher-performing''. In this research, the overall marks of students have been used for the behavioural analysis while the labelling is used as a target variable, i.e. ''higher-performing = 1'' and ''lower-performing = 0'' for the evaluation of the predictability of behavioural data to the students' learning outcomes.
All lab exams are mandatory and carry the same weight in the overall assessment. Students are, therefore, required to take the exams seriously by doing all given programming tasks as much as they can. The last exam in each module is the most challenging one, requiring a comprehensive understanding of the course knowledge to solve the given problems. Therefore, the results of the last exam will be used as a basis for further analysis in this research.
The two courses are expected to provide students with fundamental knowledge and skills in Python programming. Both modules are mandatory and key to the aims of the overall outcomes. Because they are prerequisites for the programmes, students are expect to be equally motivated as they cannot follow curriculum without deep understanding of these modules. Hence, we assume that students are likely to take these modules seriously and participate fully to maximise their learning benefits.
The difference between the two modules relates to the level of knowledge and task requirements. In Course#1, students are  taught more advanced concepts in programming and given more challenging exercises, compared to students in Course#2, as that relates to their specific programmes. As a result, students in Course#1 generally have more activities in learning than Course#2 students, as can be seen in Table 1. In other words, while Course#1 can be seen as a typical programming course for CS students, Course#2 represents a programming course for non-IT learners who still need programming skills at a certain level. It is important to note that both courses have the same coordinator and the curriculum had not significantly changed over the two academic years. Therefore, the learning motivation of students is expected to be the same in both courses. However, their behaviours can be distinct due to the differences in the level of course requirements. As a result, these datasets can reflect the diversity of learning characteristics of students' learning behaviours, giving a good quality of data.

Event logs for learning behaviours
We can consider a real-life scenario of student learning programming on the online learning system as follows. On a day in Week 5, student s1 read a labsheet for a task instruction. While reading a labsheet, the student also switched between lecture notes and the labsheet two times, and another two mouse events on the lecture page were logged. Then the student can write code to solve a given task and upload it to the system via the submission portal. All these learning events of the student s1 can be recorded and stored as event data structure so-called event log which can be seen in Table 2 as an example.
We adopt the format of event log in Process Mining [12] to store the students' learning behaviour. An event log includes a collection of events implemented in chronological order. Each event belongs to a learning trace which refers to the sequence of events of a student within a day. Event logs may contain other attributes such as timestamps, participants and results. In the context of this research, a student's learning event log comprises the following information: • Trace id: A trace refers to a sequence of learning events of a student in a day. For example, Table 2 illustrates two learning traces associating with 12 August and 13 August 2018 of the student s1.
• Event Item: An event item refers to an item of course material of the corresponding week where students' interactions with the system are logged. • Timestamps: Timestamps refer to the date and time when the corresponding event occurred. The timestamp is essential information as it will be used for ordering events and reflecting the behaviour of students.
• Student id: Student id refers to the identity of students.

Learning behavioural features
We use two types of features to reflect the students' learning behaviour in this research. The first type is the event item features, i.e. the number of events that occurred in each course material item. Event frequency features extracted from the event log can be arranged as a student-event item data matrix where each column refers to the number of events on a material item generated by students and each row is the data for each student. An example of a student-event item data matrix can be seen in Table 3.
The second type is the transition frequency features, i.e. the number of occurrences that a student moves from an event on a course item to another event. Please note that the two events can be on the same item or two different items. We use the term transition to denote this phenomenon of moving between consecutive events. The transition frequency features can be arranged as a transition-student data matrix where rows refer to transition frequency features and columns are the data for students. An example of a transition data matrix of an event log can be seen in Table 4. The value of Lecture1-Labsheet1 for student s2 equal to 14 indicates that student s2 performed an event 14 times on the Lecture1 directly before the next event on Labsheet1. Please note that if the two materials are the same, e.g. Lecture1-Lecture1, the transition reflects a loop in the learning process, i.e. the student keeps working on the same course item Lecture1.
In this paper, the student-event item data matrix has been used for the learning outcome predictions and the transition-student data matrix has been used for analysing the relationship between students' learning behaviour and their assessment performances.

PCA and random matrix theory for behavioural features
Given a m × n data matrix G extracted from an event log, we can normalise the matrix G as G(n) as follow: where G(n) j is a column of the matrix G(n); G j is a column of matrix G. In case G is a transition-student data matrix, G j denote a series of the frequency of transitions a students j. On the other hand, if G is a student-event item data matrix, G j denote a series of the frequency of accesses to the corresponding learning item j from all students. G j is the mean value of G j and σ j is a standard deviation of G j . In other words, G j and G(n) j reflect the learning behaviour of student j. The correlation matrix C can be expressed in terms of the inner product of G(n) i and G(n) j as follows: It may be noticed that the correlation C ij can reflect how similarly two students i and j interacted with course material items. If C ij > 0, the transitions of the two students i and j increased together and the students behaved similarly in the course. Conversely, if C ij < 0, the two students tend to behave differently on the learning system. The characteristic equation of C can be shown to be given by: CV = ΛV where Λ is a n x n diagonal matrix of eigenvalues λ i and V is a matrix whose columns refers to the corresponding eigenvectors v i of C.
Based on PCA theory, the new n variables x i , forming a new data matrix X = [x 1 , x 2 , . . . , x n ] can be obtained after Principal Component Analysis of G(n) as follows: where 1 ≤ i ≤ n, x i refers to the scores and v i refers to the loadings of the principal component i. In other words, each principal component has its own eigenvalue and eigenvector. We can also reconstruct the original normalised data G(n) from X as follows: In addition, given a random matrix A where A is a m × n matrix with randomly distributed elements with zero mean and unit variance. It has been shown that [32] the properties of C can be compared to the correlation matrix R of the random matrix A as R = 1 m AA T According to RMT, the statistical properties of such a matrix R are known [33]. In particular, when the sample size m → ∞ and the number of features n → ∞, providing that Q = m n ≥ 1 is fixed, the distribution of eigenvalues λ of the random matrix R is given by the Marcenko-Pastur probability density function [34]: where λ − ≤ λ ≤ λ + , λ − and λ + are lower and upper limits, eigenvalues of R respectively, given by: where σ = 1 due to A having an unit variance. We note that λ ± are the upper/lower limits of theoretical eigenvalues distribution. Eigenvalues falling outside of this range are assumed to deviate from the expected values of the Random Matrix Theory [35]. As a result, by comparing this theoretical distribution with the empirical data, we can identify the key eigenvalues containing specific information in the data. This characteristic of the RMT supports the need to clean the effect of noise and trend in the data [34].

Noise and trend effect cleaning
We have noticed that, in practical usage of the online learning system, students may interact flexibly with course material items. Although the students can be given the same instructions and learning pathway, they are free to use learning functions in their own way. This phenomenon appears to create noise in the event log data. On the other hand, as all students attended the same lectures, the learning instructions given to them are  the same. As a consequence, all students may interact similarly with course material items. We can observe this trend effect in Fig. 1. Most transitions among students are highly correlated. This issue may limit the chance of detecting the difference in learning behaviours among groups of students. Therefore, it is necessary to clean the effect of noise and trend in the dataset [34]. We also observe a phenomenon that students' learning behaviours can be affected by a trend factor, i.e. they were asked to follow the same instructions and learning pathway in the class, causing highly positively correlated learning behaviours among the students. (Fig. 1). By removing such a trend component in a classroom, the remaining components of the correlation could explain better the characteristics of the students' learning behaviours. In this paper, we adopt, from financial references such as [36,37], the concept of a ''Market Component''. This is the largest eigenvalue of a correlation matrix representing a cross-market effect affecting all stocks. Similarly, the trend effect in a classroom can be reflected by the largest eigenvalue of the correlation matrix of students' learning behaviours.
In the following sub-sections, we discuss the methods to clean the correlation matrix of a dataset as well as propose a method to clean the dataset based on Random Matrix Theory.

Cleaning the correlation matrix
Having reviewed a number of correlation cleaning methods (e.g eigenvalue clipping [35,37] and linear shrinkage [38]), we utilise the eigenvalue clipping because it was found to be the best in terms of its ability of removing the noise while preserving the information part, i.e. the trace of the original correlation matrix, by simply utilising the results of the Marcenko-Pastur equation [39] instead of choosing a parameter during the cleaning process such as linear shrinkage and Rotationally invariant, optimal shrinkage [40]. The eigenvalue clipping provides robust out-ofsample performance [41] and has also been widely adopted [42,43].
Let λ 1 , . . . , λ N be the set of all eigenvalues of C and λ 1 > · · · > λ N , and i be the position of the eigenvalue such that λ i > λ + andλ i+1 ⩽ λ + . Then we set where j = i + 1, . . . , N. In other words, we keep all the upper bound eigenvalues, i.e. those with information, and replace all lower bound eigenvalues, i.e. those within bounds predicted by RMT, with the average value of them. Hence, this method can best preserve the trace of the original correlation matrix. The new set of eigenvalues can be used to construct a denoised eigenvalue and spectrum associated correlation matrix C denoised . [34].
The effect of the first eigenvalue and eigenvector can be removed from the denoised correlation matrix as follows [34], forming a cleaned correlation matrix: where W 1 and V 1 are the first eigenvector and eigenvalue of C . An example of the effect of the cleaned correlation matrix can be seen in Fig. 1 and Fig. 2. The two figures illustrate the correlation coefficients of the transition-student data matrix in the Course#1-2018 dataset, i.e. each dot in the figures refers to the correlation of one student to another student. The scale on the right side of the two figures indicates the range value of the correlation coefficients. We can notice that the dots in the diagonal refer to the correlation of transition data of a student to her/himself (i.e. correlation values = 1). It may be seen in Fig. 1 that the majority of the dots are in different shades of green. This phenomenon may reflect a ''trend effect'', i.e. students' learning behaviours can be similar and highly positively correlated with other learners' behaviours in the same class. These issues may negatively influence the construction of prediction models though. After cleaning the data, there are more neutral and dots shades of orange visible in Fig. 2, indicating the negative correlation values. That is to say, the C cleaned correlation matrix may contain differences in student learning behaviours, creating more chances to better cluster the students. Similar results are observed in the other three datasets (Course#1-2019, Course#2-2018 and Course#2-2019).
The use of cleaned correlation matrix in Community Analysis is discussed in Section 3.6.

Cleaning the dataset
While the cleaned correlation matrix is expected to be useful in community analysis, the prediction of learning outcomes, however, require a tabular dataset. As a result, it is necessary to clean the original data matrix instead of the correlation matrix in case of improving prediction models. In this section, we propose a method to clean the original data matrix based on Random Matrix Theory.
In terms of the eigenspectrum of the correlation matrix, let λ 1 , . . . , λ N be the set of all eigenvalues of C and λ 1 ≥ · · · ≥ λ N , and k be the position of the eigenvalue such that λ k > λ + and λ k+1 < λ + . We note that λ 1 refers to the largest eigenvalues and the first principal component. The clean datasetĜ can be constructed as follows: 1. The cleaned datasetĜ can then be used as input for machine learning predictive models. We expect that the performance of the predicting models usingĜ, either fully or partly cleaning, will be improved in comparison with the use of the original dataset and simple PCA-based datasets. Results are shown in Section 4.2.1.

Distance matrix of students learning behaviours
Although the correlation values appear to be useful in reflecting the similarity and difference in students' learning behaviours, they are not appropriate metrics as they do not satisfy nonnegativity and triangle inequality conditions [34]. For example, the difference between the correlations (0.8, 1.0) is the same as (0.1, 0.3), but the former tuple illustrates a higher difference regarding co-dependence. Fortunately, it is possible to translate the correlation matrix into a distance matrix D as follows: [34] with D ij ∈ [0, 1] where D ij is a distance value of learning behaviours between the two students i and j. The value closer to 1 refers to where two students interact completely differently with the course material items while the value closer to 0 indicates that two students behave similarly. Fig. 3 indicates an example of the learning behavioural distance matrix of students in Course#1-2018 dataset. The diagonal comprises zero values, illustrating the behavioural distance between students and themselves. The other distance values range from 0 to more than 0.6. The distance matrix can be used to construct a community graph which is discussed in the next section.

Community analysis
To verify whether students with similar behaviours perform similarly and vice-versa in lab exams, we choose to adopt a network-based approach. Generally, a graph is constructed based on the concept of a distance matrix. In the graph, each node represents a student and the edge weight between two nodes indicate the distance between learning behaviours of the two students. Then, a clustering technique can be applied to the graph to detect the communities where students having similar learning behaviours are grouped.

Construct graph from distance matrix
A graph can be constructed directly using the distance matrix values D ij as edge weights. Unfortunately, such a network is a hardly readable weighted complete graph as each node (student) has a connection to all other nodes in the graph. Additionally, we note that the time complexity of community detection algorithms is proportional to the number of edges and nodes in the graph. For example, the time complexity of the Girvan-Newman algorithm [44] is O(m 2 n) where m is the number of edges and n is the number of nodes (or students). With such a fully connected graph, the number of edges is m = n(n − 1)/2, which may lead to, in the worst case, the time complexity of the algorithm of O(n 5 ). To overcome this issue, one possible solution is to reduce the number of edges in such a fully-connected graph. It is important to minimise the number of edges in the constructed graph while preserving the purpose of grouping students having similar behaviours.
In the context of this paper, all values of the distance matrix of each dataset are identical. In other words, all edge weights of the fully connected graph constructed from the corresponding distance matrix are unique. Taking advantage of this characteristic, we adopt the notion of Minimum Spanning Tree (MST) [45], i.e. an MST is constructed for each graph and connects all students in a course without having any loops. With the distance matrix D as the adjacency matrix of a graph, an associated MST is constructed such that the sum of all edges in the graph is minimal for all possible spanning trees. We note that if all edge weights of a graph are unique, then the graph has only one corresponding MST. Hence, in our case, each course dataset can be used to produce a single associating MST. It can be seen that the MST of a set of n students is a graph with n − 1 edges, reducing the time complexity of the Girvan-Newman algorithm to O((n − 1) 2 n) = O(n 3 ). Furthermore, in the MST of a whole course, each student can be connected to one or more other students who have the most similar learning behaviours with that student. Therefore, the clustering purpose is preserved.

Community detection on MST graph
Based on the MST constructed from the distance matrix, it is possible to advance to the further step which is community detection which is supported by several methods [44,46]. In this research, we utilise the popular detection algorithm from Girvan-Newman [44] which is applied in various domains such as biology [44], finance and cryptocurrencies [47]. The algorithm aims to divide the whole network into smaller communities or groups by progressively removing edges with the highest edge betweenness until no edges are remaining. Betweenness is the number of the shortest paths between pairs of nodes that run through it from the original network [44]. Please note that we take the weight of edges into account when calculating the edge betweenness. The nodes, i.e. students in a smaller group, are highly connected to each other than the ones outside the group. Fig. 4 illustrates an example of the MST constructed from the data for Course#1-2018 in week 12.
The detected groups can be used for further investigation regarding their performance in lab exams. In particular, we can use statistical tests to verify if the lab exam grades are significantly different between the communities. As it is not guaranteed that the data of students in each community will be normally distributed, a non-parametric test is preferred in this case, i.e. Mann-Whitney U Test [48] has been utilised. It is also possible to verify if the two communities interacted differently with each course material item in the system. Further investigations are discussed in Section 4.

Selecting the number of detected communities
We observe that the Girvan-Newman algorithm can be seen as a hierarchical method, i.e. it constructs a dendrogram that shows the hierarchical clustering structure. The number of detected communities can, therefore, range from 1 to the number of nodes in the graph where each community contains only one node. When using Girvan-Newman, it is necessary to determine criteria to decide the cut-off level in the dendrogram to create the resulting communities.
In this research, we define the concept of mixed community rate. Let C = (c 1 , c 2 , . . . , c n ) be a community structure. Let c i = (h i , l i , n i ) be a detected community where h i is the number of higher performing students, l i be the number of lower performing students in the community c i . The label n i of the community c i is identified as Eq. (10) below: The parameter k can be configured, depending on analysis purposes. In the ideal case of k = 1, a community will only be labelled as higher or lower performing if it contains only higher or lower performing students. However, we expect the similarity in learning behaviours between students in practice and it could be difficult to detect such a homogeneous community. Instead, we set k = 0.7, i.e. a community is labelled ''higher-performing'' if there are greater than or equal to 70% of higher performing students in the community and similarly for ''lower-performing'' communities. Otherwise, the communities are labelled as ''mixed''. The mixed community rate of a community structure can be computed as follows: The higher/lower performing communities may include key features about student success while mixed communities may contain less information. As a result, we expect a good community structure containing fewer mixed communities. Based on the mixed community rate indicator, it is possible to investigate each possible community in the resulting dendrogram from the Girvan-Newman algorithm and identify the number of detected communities by considering their mixed community rates. We also make a comparison between the original dataset and the cleaned dataset in terms of the community structures detected from them. If cleaned datasets can be used to produce community Table 5 Summary of the features extracted from the four datasets at the end of the courses (after week 12).

Dataset
Number of  Number of  Number of  materials  transitions  students   Course#1-2018  37  819  112  Course#1-2019  37  867  155  Course#2-2018  26  409  62  Course#2-2019  26  496  48 structure with lower mixed community rates, the cleaning method can show its effectiveness in community analysis. Although we mainly focus on the Girvan-Newman algorithm in the scope of this paper, the Louvain algorithm [46], a commonly-used community detection algorithm [49], is also used as a benchmark to verify if the two algorithms produce significantly different results. Particularly, we utilise v-measure score [50], a widely-used clustering metric to measure the agreement of two independent community assignments strategies produced by the two algorithms for each dataset. Furthermore, we also investigate if our cleaning method can support the Louvain method to generate better communities with lower mixed community rates for the cleaned data in comparison with the original data.

Early prediction of learning outcome
To evaluate the predictability of the students' interaction with course material items for the learning outcomes, we use the student-event item data matrix as the input variable. Particularly, we combine the student-event item data matrix of Course#1 in both academic years into a single tabular dataset. Then, we conduct the cleaning method proposed in Section 3.5.2 on the dataset, forming fully cleaned data and partly cleaned data. For comparison purposes, we also use the original and PCA transformed datasets as predictors. The target variable is defined based on the student's scores on each lab exam. There are three lab exams for Course#1 on Week 4, 8 and 12. We classify students who achieved more than 40% in the exam as higher-performing or passed students and the remaining as lower-performing students.
For each week, a student-event data matrix has been extracted from the corresponding of the weekly event logs. The data collected in a certain week contains recorded learning events from the beginning of the course to that week. Then, the weekly data was used to predict the student results for the next exam. For example, in Course#1, the data collected in weeks 1, 2, 3, 4 are used to predict the results of the lab exam 1; the data in weeks 5, 6, 7 and 8 are used for predicting lab exam 2 results and the remaining weekly data are used to forecast the last exam result.
In terms of prediction algorithms, Support Vector Machine (SVM) appeared to be the most effective technique for the data captured from MOOCs in many contexts [51,52]. In addition to SVM, for references and comparison purposes, we also pick four additional classification techniques including XGBoost [53], Logistic Regression [54], Gradient Boosting [55] and K Nearest Neighbours [56] due to their widely applications in Learning Analytics domain [21]. In terms of development tools, we use sklearn libraries [57] and Python as the main programming language.
Each dataset serves as input data for all algorithms with the same parameter configuration in each technique. In each dataset, 80% of the data has been used for training the models and the remaining 20% are for validation. The 10-fold cross-validation technique has also been applied, using ROC_AUC, Accuracy and F1 scores, to evaluate the predicting performance of each model.

Experimental results
In this section, we present the analysis results of the four datasets mentioned in Section 3.1. Learning behavioural features are constructed from the four datasets, as summarised in Table 5. Course#1 has more material items and transitions than Course#2. This is because Course#1 has been delivered to Software Engineering students and was more intensive than Course#2 which targets Business Computing students who may be ''less-technical''.

Selecting community structure
The extracted datasets illustrated in Table 5 have been standardised, and this is followed by the calculation of crosscorrelation matrices. The correlation matrices are cleaned before being used to calculate the learning behavioural distance matrices. The distance matrices have been used to construct MSTs. In other words, for each module, we construct a graph as an MST to display the similarity and dissimilarity of the students' learning behaviours. Based on the MSTs, we implement the Girvan-Newman algorithm for community detection. Students in each module can be divided into a smaller number of communities based on the distance between their learning behaviours and other learners' behaviours.
We note that the number of groups to be detected by the Girvan-Newman algorithm can be configurable depending on analysis purposes, forming a community structure. In this research, we rely on Mixed community rates, i.e. a good community structure should contain fewer mixed community rate and  Fig. 5 and Fig. 6 shows the investigation of mixed community rate for each possible community structure detect by the algorithm in Course#1 in both academic years. Indeed, the number of detected communities can go up to the total number of students in the whole graph. However, we do not want a fragmented community structure where each community contains only a few students. Hence, we merely show a part of possible community structures in both figures. Both Figs. 5 and 6 show that the cleaned dataset have a better support for community detection in comparison with the original dataset. Overall, the mixed community rate in the community structures detected using the cleaned datasets are lower than the figures for the original datasets for Course#1. We also observed a similar phenomenon for the Course#2 datasets. In addition, based on these figures, it is possible to determine the community structures, using the lowest point of the mixed community rate line. The detected results can be seen in Table 6 for Course#1 and  Table 7 for Course#2. In Table 6, eight groups have been detected with the number of students in each group and its average grades of the final lab exam in week 12. Similarly, Table 7 displays nine detected groups for Course#2-2018 and eight groups for Course#2-2019. All groups are ordered from the highest to the lowest average grades in the tables.

Analysing highest vs lowest performing communities
The highest and lowest-performing communities in Tables 6  and 7 can be picked for the further investigation of the difference of interactions with course material items among the student cohorts. Table 8 demonstrates the difference in the number of learning activities in using learning material items in Course#1 between the two groups while the results for the Course#2 can be seen in Table 9. For each item, non-parametric statistical tests (i.e. Mann-Whitney U Test [48]) have been used to verify if there is a significant difference between the highest and lowest performing communities in terms of using the item during the courses. The course material items in which the highest and the lowest performing communities have a significant difference in the number of events (p-value < 0.05) are highlighted. Table 8 Highest vs Lowest performing communities in Course#1. The asterisks indicate the learning items where there is a significant difference between the two communities (p-value < 0.05). In particular, * if there is only a significant difference in Course#1-2018 only, ** if there is only a significant difference in Course#1-2019 only and *** if there are significant differences in both academic years.

Items
Course#1 Regarding learning events on practice-related items, students in the highest performing community appeared to be more active than the lowest-performing community, with the higher average number of learning events in all Practice and Labsheet items across all four datasets. These gaps are likely to increase over time. For example, in Course#1-2018, the average number of events in Practice_11 (i.e. practice items in week 11) of the highest performing community is about three times higher than the figures for the lowest performing community. A similar phenomenon can be observed in the data of other cohorts. Nevertheless, students in the lowest performing community are recorded to create a higher number of events in lecture records for both classes of Course#1 and Course#2-2018. For example, the number of events in lecture notes in weeks 2-7 created by the lowest performing community is about two to three times higher than the figures for the highest performing community in Course#1.

Girvan Newman vs Louvain methods
We compare the Girvan-Newman community detection results selected above (Tables 6 and 7) and the corresponding results produced by the Louvain method in the four datasets. Table 10 illustrates an investigation of the possible difference Table 9 Highest vs Lowest performing communities in Course#2. Statistical tests were not conducted in this result because there are merely a small number of students in the two communities (i.e. 5 and 6 students).

Items
Course#2  It can also be seen that the v-measure scores for the cleaned data tend to be higher than that of the original data. This may imply that the proposed cleaning method may support the reduction of variation between the two algorithms when they are applied to the same dataset. Additionally, the results in Table 10 indicate the values for mixed community rates of the community structure detected by the Louvain method. The rates for the cleaned data are likely to be lower than those for the original data. This is consistent with the application of the Girvan-Newman method, i.e. the cleaned data can also support the Louvain method to deliver better community detection results with the lower number of mixed communities. Fig. 7. Comparison of the roc_auc scores of predicting models using different data pre-processing strategies. Fig. 8. Comparison of the accuracy scores of predicting models using different data pre-processing strategies. Fig. 9. Comparison of the f1 scores of predicting models using different data pre-processing strategies. Fig. 8, 7 and 9 demonstrate the ROC_AUC, Accuracy, and F1 scores from different models and input datasets, respectively. It can clearly be seen that the three figures illustrate a similar pattern in the difference of the evaluation metrics among the models. Overall, fully and partly cleaned datasets appear to have better-predicting performances in comparison with other data preparation strategies. In particular, regarding the fully cleaned dataset, the Gradient Boosting outperforms other models in all three metrics (i.e. Accuracy: 0.79, F1 score: 0.785 and ROC_AUC: 0.81), followed by KNN models, XGBoost and SVM. In terms of the partly cleaned dataset, the models have shown that they have  Conversely, models using the original and full PCA datasets appear to have lower performances across all predicting algorithms with the scores roughly around 0.60 and 0.70. Meanwhile, although the PCA dataset with the top largest principal components has the lowest performance in Gradient Boosting and KNN, the dataset has shown its predictability in the SVM algorithm with the highest accuracy and roc_auc scores. It is possible that when only top principal components were kept in the dataset, the noise part has been eliminated. Fig. 10 illustrates the mean of the cross-validation on ROC_AUC score of the models using fully cleaned dataset over 12 weeks during the course while Fig. 11 shows the results for the partly cleaned dataset. In general, most of the models can produce good predictions for the datasets after week 4. The ability to classify students of the data in the first four weeks is relatively poor, which is expected, probably due to the imbalance of the number of passed and failed students in lab exam 1. In fact, lab exam1 usually comprises the easiest tasks which merely require the understanding of simple concepts in programming, e.g. using variables, operators and inputs. As a result, the majority of students usually pass the first exam. However, the difficulty level increases over lab exams 2 and 3, causing the target variable to become more balanced. Hence, the models can better predict the pass or failure of a student in lab exams 2 and 3.

Early prediction investigation
Although the performance of the models increases over time with the growth of the data collected, early data can support relatively good prediction. For example, the XGBoost model for week 5 data, which predicts the students' results of lab exam 2, achieves the ROC_AUC score of 0.78. The SVM model in week 9, which predicts the final exam in week 12, also achieves the acceptable result with a score of 0.80. These models appear to have a better performance than a recent prediction model in a similar computing educational context [23] where the author achieved the ROC_AUC score of 0.73 with the SVM model in the prediction of the student learning outcome in the Data Structure course. Therefore, early learning behaviours data may contain signs of students' learning outcomes [58] and can be good predictors, holding the potential to be a ''leading indicator'' of ''at-risk'' students.

Implications and limitations
From the above, we believe that it is possible to say that there is a relationship between the students' learning behaviours and their exam performances. We found that the students who are grouped in the same community were likely to achieve similar exam results. In other words, students having similar learning behaviours tend to perform similarly in the exam. This finding is in agreement with [59,60] where the authors have defined and analysed various learning styles with different learners' behaviours in perceiving and responding to learning environments. Moreover, the learning styles appeared to affect students' satisfaction and can also be a useful indicator of learning success [11].
Overall, the learning behaviours of students in Course#1 and Course#2 in both academic years tend to be similar. In both modules, we have found what seem to be differences between lower and higher performing communities. In particular, higherperforming students were found to be more active in practisingrelated items such as navigating lab sheets and doing exercise. Besides, the higher performing students consistently interacted with course material items and exercises during the courses. The lower performing students, however, appeared to lose their focus and motivation to practice, i.e. actually do programming tasks, in the later stages of the study. This result is consistent with the initial investigation of programming [61] that practice is essential for improving students' programming skills. These findings, available so early in the semester, are essential for such core courses especially since it has been found that students should be given opportunities to practice and receive constructive feedback [19]. In [62], the authors indicated that programming skills may be improved if students practice frequently. However in the context of this research, the students from the lower performing group might face challenges during their study progress at the later stage of the course, e.g. the knowledge was becoming more difficult to understand. As a consequence, they might lose their confidence and motivation to actively participate in practical sessions, additionally highlighting the need for early intervention and encouragement. Besides, we noticed that the lower performing students tended only to try to solve only those programming tasks according to the common methods rather than creatively trying different approaches. As a result, they mostly tend to upload solutions once and move to other tasks. In contrast, the higher performing students tend to try various approaches for a given programming task and they submitted them all and once, leading to a higher number of events in practical items logged on the system, in comparison with the practical activities of lower performing communities.
We also found that there seems to be a distinction between the learning behaviours of Course#1 and Course#2 cohorts. In Course#1, the lower performing students appear to focus more on reading lecture notes than higher performing students. However, this phenomenon is not observed in Course#2. Particularly in Course#2, there is almost no difference in reading lecture notes between the two types of students. Even, the higher performing students in Course#2 seems to be becoming more active in reading lecture notes in several weeks during the semester. In fact, the level of knowledge in Course#2 tended to be lower than Course#1 with less advanced concepts and examples. We note that Course#1 was designed for Computer Science students and has a higher level of requirements for acquired knowledge and skills. Perhaps, the lower performing students in Course#1 might be struggling with acquiring new advanced concepts, which would keep them engage more with lecture notes instead of doing programming tasks.
In terms of the learning outcome prediction, using log data collected from online learning systems to predict students' success has been highly developed in the literature. There have been many scientific reports on building an early predicting system in many application contexts, from flagging ''at-risk'' students [63], to recommending next courses [64] and learning strategies [65,66]. In our research, we provide a pre-processing data method that has been proven to be effective in improving the performance of widely used machine learning models in our context, i.e. programming education. This method can also be extended to different application contexts above as long as the data satisfies the assumptions of Random Matrix Theory.
We recommend instructors to keep implementing community detection and prediction as students' results come in. Other performance indicators can also be used in addition to lab exam grades, such as weekly exercise results. In practice, community detection can be implemented at any point during the study. Once communities are detected, the instructors can implement promptly interventions. For example, the higher performing groups can be given harder exercises to keep them focused and avoid getting bored of the study. On the other hand, the lower performing groups should be given more basic tasks along with instructions or tutor sessions. Furthermore, the instructors can provide lower performing communities with additional supporting materials or easier tasks with solutions. This would fill the knowledge gap and build up the confidence and motivation for the students as well as re-engage them in the study.
However, although the proposed method appears to be successful in reflecting the relationship between students' learning behaviours and learning performance, there are limitations due to the assumption of Random Matrix Theory which might restrict the method from being applicable to all kinds of learning behavioural data. The distribution of eigenvalues is given by Eq. (6) when the sample size (matrix rows) m → ∞ and number of features (matrix column) n → ∞, provided that the ratio of rows and columns is greater than or equal to 1. Hence, in the context of this paper, the number of transitions extracted from event log data is needed to be greater than the number of students. In addition, the application of the RMT could be less effective for small size datasets, i.e. with a small number of students and course material items, although in that case community detection might not be that useful.
There is also a concern in terms of using MST to reduce the size of the graph. When a distance matrix contains duplicate values, the associated graph will have duplicated edges. Consequently, there can be more than one MST being generated from the graph and thus the results of the analysis may not be stable. In such cases, other graph size reduction techniques can be considered to obtain a single reduced graph, ensuring the stability of results in further analysis. For example, in [67], the authors proposed a network sparsification technique that sparsifies the network while preserving network structures and community properties. The comparison between such techniques is out of the scope of this paper and will be the target for future works in line with this research.
In the future, we will also focus on changing the community structure (currently represented as an MST) of the students during the course. For example, a student may change their group in a different week, which may reveal that his or her learning behaviour also changes accordingly. This analysis can help to understand thoroughly how students are studying and provide better support for educators to improve the curriculum. However, this requires more advanced research approaches to be developed to process more complex data. The time duration on course material items will also be considered on top of the number of events in future works. Additionally, while it has been found that the community analysis results, when using either Girvan-Newman or Louvain method, do not vary significantly, the relationship between community detection techniques and analysis results is also worth further investigation, we believe. We will target this in future works, to further investigate all the insights of learning behaviours among student communities.

Conclusion
In this paper, we propose a novel approach to analyse the students' learning behaviours data collected from an online learning system in the context of programming education. This research is one of the first attempts to utilise RMT and Community Detection in the educational domain. The analysis is based on a range of techniques. First, we extract a transition-student data matrix from the event log data. Second, we clean the effect of noise and trend in the correlation matrix of the transition-student data matrix, which is based on the Random Matrix Theory. This cleaning process can help to reveal the underlying meaning of the data. The cleaned correlation matrix is used to construct a distance matrix and the Minimum Spanning Tree. The MST can represent the relationships of students' learning behaviours in using course material items in the form of an MST graph. Students having similar behaviours are closer to each other in the constructed MST graph. The community detection algorithm, i.e. Girvan-Newman, has been applied to detect the smaller student groups from the MST. Furthermore, the student-event data matrix is also cleaned and used as input variables to predict the learning outcome of students in the lab exams, using a range of machine learning classification techniques. The findings from the above method have been used to analyse the learning behaviours of students with different learning abilities in programming. The proposed approach in cleaning learning behavioural data also shows its effectiveness in community analysis and building early prediction models. Insights from students' learning behaviours and recommendations are also discussed in the paper.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.