Keywords

1 Introduction

Since online learning can generate large amounts of records in students’ learning process, it provides an effective way to get deep understanding of students’ learning behaviors and predict their academic performance. Due to the benefits of online learning, more and more universities combine traditional place-based courses with online education to achieve better teaching results. For this kind of course, it is feasible to give early predictions of the students’ final performance through the student’s online learning records, so that a timely pre-intervention could be carried out for at-risk students.

In this paper, we conduct research on university students’ academic performance prediction for a course which combines the online learning and traditional place-based learning. While current researches generally focus on the learning behaviors records collected from the corresponding learning management system, but ignoring other factors that may be potentially relevant to students’ academic performance. As indicated in [15], internet access activities were discovered to be a major factor affecting students’ academic performance. Both users’ online behaviors which can be clustered into several distinct pattern [14] and students’ static information [5] have impacts on the academic performance prediction. In this study, two types of data are collected from 505 anonymous students to predict at-risk students in a university project-based course. One dataset records the students’ online learning activities of the course which provides a learning platform for self-study. The other collects Internet access activity data from the campus network logs, from which the students’ behavioral patterns of accessing the Internet can be explored. Combining these two datasets will obtain deep insights into students’ learning behaviors and the correlation with their academic performance.

The remaining part of this paper is organized as follows. Section 2 reviews the related work on the EDM techniques for predicting student’s performance and feature learning from time series. Section 3 mentioned two types experimental datasets used in detail. Section 4 presents the proposed SPDN model. Experimental results are described in Sect. 5 and finally Sect. 6 concludes this work and discusses future avenues of research.

2 Related Work

2.1 Related Methods of Education Data Mining

There has been a large amount of relevant work about the student performance prediction. The current methods of EDM are generally divided into two categories. The first traditional method relies on machine learning methods for binary classification prediction. In [1, 3, 4, 13], each machine learning model considers different types of predictive features extracted from raw online learning activity records to predict whether students can graduate on time. Also generalized linear model is used to predict students’ dropout by extracting features from the original learning website log files such as page click rate, forums and so on [2, 10]. The second emerging approach involves the exploration of neural networks (NN). Because deep learning achieves better performance than traditional machine learning in many respects, work has been done to predict students’ dropout in MOOC through deep neural network (DNN) models [10] and recurrent neural network (RNN) models [6]. Different from all current methods which still rely on feature engineering to reduce the input dimension and limit the development of larger NN models, Kim et al. [11] propose GritNet which extracts the original learning behavior sequence from network log as raw input of the RNN model. It outperforms the standard logistic-regression based method without complex feature engineering.

2.2 CNN for Behavioral Feature Learning of Time Series

The method of learning a time series feature is to represent a sequence of behaviors within a time window as a low-dimensional vector. KimCNN [12] is a typical CNN structure, which applies the convolution operation with several different size kernels on every possible location of the activity vector matrix, and use max-pooling to get the most prominent feature. In this way, it can automatically extract the features of the behavior sequence, and the model can be easily transferred to other datasets. KimCNN has been used in news recommendation to fuse semantic-level and knowledge-level representations of news [17]. Wang et al. proposed knowledge-aware CNN (KCNN) to treat words and entities as multiple channels instead of simple concatenating, and explicitly keeps their alignment relationship during convolution. In this way, it is suitable to connect words and associated entities and convolute them together in a single vector space. In this paper, we implement this structure to fuse student Internet access activities and learning activities and learning student behavioral representation in MFCNN component.

3 Dataset Description and Insight

The analysis in this work is based on two datasets from 505 anonymous students. One of the datasets is website log of a university project-based course for freshmen which involves students’ online learning activities. The other one is the campus network logging record which reflects students’ Internet access activities.

In this section, we will introduce and describe the details of online learning activities with the university project-based course and Internet access activities. Then, we investigate four distinct online behavior patterns by clustering.

3.1 Online Learning Activity

Students’ online learning activities are extracted from the online learning website log of a university project-based course which spans over 13 weeks from 2018.9.28 to 2018.12.27. This course aims to help freshmen students get started in communication engineering and its greatest characteristics is implementing the online education combining with traditional class. The course is taught by teachers every Friday and students can learn on course’s wiki and forum messages from the online learning website, also create their own wiki post or participate in the forum. Meanwhile, all online learning activities of students will be recorded in the website log as online learning activity sequence. Table 1 lists statistics of actions in this dataset which involves two categories of activities such as viewing and writing. Each category involves six activities and we have a more detailed distinction between different types of web pages for each activity.

Table 1. Statistics of learning behavior dataset of the Introductory course

In addition, there is a weekly quiz on each Wednesday which are scored by the teachers. And the students are grouped to do the final innovation project which are scored by the teachers. The final performance of students consists of two parts: average score of weekly quiz and the final innovation project score. In this paper, we judge the at-risk students based on the course results. Specifically, students are considered at-risk students whether their average score of quizzes or innovation project scores is at the last 25% of the whole grade. Because he or she is lacking in theory or practice. In our dataset, there are 202 at-risk students in total.

3.2 Internet Access Activity

The campus network can record the students’ internet access activities in the log file which contains the categories of URLs and corresponding timestamp. There are 11 categories of internet access activities, namely: ‘News’, ‘Game’, ‘Music’, ‘Download’, ‘File transfer’, ‘Search engine’, ‘Video’, ‘Shopping’, ‘Living tools’, ‘Instant messaging’ and ‘Non-instant messaging’. The log file holds a total of 22 million records for 505 students during the semester of the project-based course.

In order to build a complete student online activity sequence, we converge the students’ online learning activities with current Internet access activities based on the students’ anonymous IDs. We rename the all online learning activities to a new category of internet access activity, ‘Learning’, and merge them with the original internet access activities in chronological order as the new internet access activities. In order to ensure that the online learning activity sequence and the Internet access activity sequence are aligned in the time dimension, we use zero padding to complete the online learning activity sequence.

3.3 Distinct Behavior Patterns and Static Information

To investigate the different online habits, we conducted a cluster analysis and feed the normalized frequency counts of each action of all students into Ward’s hierarchical cluster algorithm [7]. The number of clusters is set to 4 based on Calinski-Harabasz (CH) index [16] on the data. Table 2 shows the students’ number and at-risk rate of four clusters. It illustrates that cluster1 and cluster 2 have low at-risk rate and high proportion, while nearly a quarter of the student in cluster 3 and cluster 4 are at-risk.

Table 2. Statistics of four clusters

Specifically, Fig. 1 illustrates the proportion of occurrence frequency of each activity in different clustering patterns. The proportion is calculated by the frequency of the particular activity divided by the count of all activities for each case. And in each cluster, we calculated the average of the above proportions across all the cases which are assigned to the particular cluster. In the x-axis, we list different actions of original internet access activities and online learning activities. It can be seen that there are obvious differences between clusters. The overall access internet frequency of students in cluster 1 is very low, so they may prefer offline learning. Students in Cluster 2 often use search engines, which may be related to learning. Conversely, Cluster 3’s students prefer to watch videos and use life tool applications which may not be educational. On the learning website, they ask and answer questions in the forum relatively frequently, but there are few viewing actions. Cluster 4 has the fewest numbers, but is extremely focused on online games and rarely involves other types of online activities. On the learning website, their learning behavior is relatively inactive, which may also be the reason why the student’s at-risk rate is high in the cluster.

Fig. 1.
figure 1

The four cluster interaction patterns

In addition, students experiment in a group and always learn together in a group. So students in the same group will have a high probability of having the same academic status and the grouping information has important impact on prediction. For the reason above, we take the student’s group id and cluster patterns as static information and joint them into the framework to model the prediction of student performance.

4 Framework of Sequential Prediction Based on Deep Network

The overall framework of sequential prediction based on deep network (SPDN) is shown in Fig. 2 and can be divided into four parts roughly. In this section, we first introduce the process of constructing and embedding the complete input sequence in input representation component. Then we will discuss the details of Multi-source fusion CNN (MFCNN) which represents the student’s multiple activity sequences in weeks. After that, we will present the process of joining static information with students’ behavioural representation and feed them into bi-LSTM model for prediction. Let us begin with a formulation of the problem we are going to address.

Fig. 2.
figure 2

The architecture of SPDN

4.1 Formulation

As introduced in Sect. 3.2, we converge the campus network logging records and the course’s website log to build a complete student \( u \)’s Internet access activity sequence, and complement the online learning activity sequence by zero padding. In order to formulate the problem more precisely, we first introduce the following definitions.

Definition 1. Internet Access Activity.

Let \( \varvec{I} \) denote the set of Internet access activities. The complete student \( u \)’s Internet access activity sequence can be formulated into \( \widehat{I}\left( u \right)\, = \,i_{1:M} \, = \,\left[ {i_{1} , i_{2} , \ldots , i_{M} } \right] \), where \( M \) is the length of weekly Internet access activity sequence. Each element \( i_{t} \) is defined as a paired tuple of \( \left( {a_{t}^{i} , \,d_{t}^{{}} } \right) \), where \( a^{i} \) represents the Internet access activities such as “Game”, “Music” or “Learning” and \( d_{t}^{{}} \) is the corresponding timestamp at time \( t \).

Definition 2. Online Learning Activity.

Let \( \varvec{O} \) denote the set of online learning activities. Student \( u \)’s zero-padding online learning activity sequence which can be formulated into \( \widehat{O}\left( u \right)\, = \,o_{1:N} \, = \,\left[ {o_{1} , o_{2} , \ldots , o_{N} } \right] \left( {N\, = \,M} \right) \), where \( N \) is the length of weekly online learning activity sequence. Each element \( o_{t} \) is defined as a paired tuple of \( \left( {a_{t}^{o} , d_{t}^{{}} } \right) \) which \( a_{t}^{o} \) is the online learning activity with zero-padding at time \( t \). If \( a_{t}^{i} \) is “Learning”, \( a_{t}^{o} \, \in \,O, \) otherwise, \( a_{t}^{o} \, = \,0 \).

Definition 3. Static Characteristics.

Static information comprises student \( u \)’s group id \( Z_{g} \) and cluster pattern \( Z_{p} \). These characteristics do not vary over time and can be concatenated and represented by a vector \( Z\left( u \right) \).

Definition 4. Time Difference.

Since directly employing each timestamp \( d_{t} \) will increase the input space too fast, we define the discretised time difference between adjacent events as:

$$ \Delta d_{t} \, = \,d_{t + 1} - d_{t} $$
(1)

In this way, each activity sequence will be accompanied by a time difference sequence as \( \widehat{T}\left( u \right)\, = \,\left[ {\Delta d_{1} ,\Delta d_{2} , \ldots ,\Delta d_{N} } \right] \left( {N\, = \,M} \right) \).

Problem Formulation.

With these definitions, our task of predicting student performance can be expressed as a sequential event prediction problem: given student \( u \)’s Internet access activity \( \widehat{I}\left( u \right) \), online learning activity \( \widehat{O}\left( u \right) \) in first \( j \left( {j\, \le \,13} \right) \) weeks of the semester, as well as static characteristics \( Z\left( u \right) \), our goal is to predict whether \( u \) will be at-risk in the course. More precisely, let \( {\text{y}}\left( {\text{u}} \right)\, \in \,\left\{ {0, 1} \right\} \) denotes the ground truth of whether \( {\text{u}} \) is at-risk, \( {\text{y}}\left( {\text{u}} \right) \) is positive if and only if \( {\text{u}} \) is at-risk in the course. Then our task is to learn a function:

$$ {\text{f}}:\left( {\widehat{I} \left( u \right),\widehat{O} \left( u \right), \widehat{T}\left( u \right), Z\left( u \right),} \right) \to {\text{y}}\left( {\text{u}} \right) $$
(2)

4.2 Input Representation

In order to feed students’ activity sequence into the SPDN, we transform each online learning activity \( a_{t}^{o} \), Internet access activity \( a_{t}^{i} \) and time difference \( \Delta d_{t} \) into one-hot encoded feature vector \( {\text{l}}\left( {a_{t}^{o} } \right)\, \in \,\left\{ {0,1} \right\}^{{L_{o} }} \), \( {\text{l}}\left( {a_{t}^{i} } \right)\, \in \,\left\{ {0,1} \right\}^{{L_{i} }} \), \( {\text{l}}\left( {\Delta d_{t} } \right)\, \in \,\left\{ {0,1} \right\}^{{L_{d} }} \), where \( L_{o} \), \( L_{i} \) and \( L_{d} \) respectively are the number of online learning activity unique types, Internet access activity unique types and hours of the week. The student \( u \)’s encoding vectors are represented by \( D_{u}^{o} \, = \,\left[ { {\text{l}}\left( {a_{1}^{o} } \right),{\text{l}}\left( {a_{2}^{o} } \right), \ldots ,{\text{l}}\left( {a_{M}^{o} } \right)} \right]\, \in \,R^{{M\, \times \,L_{o} }} \), \( D_{u}^{i} \, = \,\left[ { {\text{l}}\left( {a_{1}^{i} } \right),{\text{l}}\left( {a_{2}^{i} } \right), \ldots ,{\text{l}}\left( {a_{M}^{i} } \right)} \right]\, \in \,R^{{M\, \times \,L_{i} }} \) and \( D_{u}^{d} \, = \,\left[ { {\text{l}}\left( {\Delta d_{1} } \right), {\text{l}}\left( {\Delta d_{2} } \right), \ldots ,{\text{l}}\left( {\Delta d_{M} } \right)} \right] \, \in \,R^{{M\, \times \,L_{d} }} \).

Then each one-hot vector is converted to a dense vector through an embedding layer. That means to learn three embedding matrixes \( E_{o} \, \in \,R^{{e\, \times \,L_{o} }} \), \( E_{i} \, \in \,R^{{{\text{e}}\, \times \,L_{i} }} \), and \( E_{\text{d}} \, \in \,R^{{e \times L_{\text{d}} }} \), where \( e \) is the embedding dimension. The low-dimensional embedding vectors of online learning activity, Internet access activity and time difference are defined as:

$$ \left\{ {\begin{array}{*{20}l} {v_{o} = E_{o} \cdot \text{l}\left( {a_{t}^{o} } \right)} \hfill \\ {v_{i} = E_{i} \cdot \text{l}\left( {a_{t}^{i} } \right)} \hfill \\ {v_{d} = E_{d} \cdot \text{l}\left( {\Delta d_{t} } \right)} \hfill \\ \end{array} } \right. $$
(3)

The dimensions of the various embedded vectors are the same and similar events appear to be closer in the embedding event space.

4.3 Multi-source Fusion CNN (MFCNN)

Following the process used in Sect. 4.2, the next step is multi-source fusion. We employ the MFCNN component which is multi-channel and multiple-activities-aligned to compress the representation of the student’s three types of embedding activity sequences per week. They can be regarded as representations of multiple different channels of the same action. We align and stack the three vector matrices \( {\text{V}}\, = \,\left[ {[v_{o1} \;v_{i1} \; v_{d1} ]\left[ {v_{o2} \; v_{i2} \; v_{d2} \left] \ldots \right[v_{oM} \; v_{iM} \; v_{dM} } \right]} \right] \, \in \,R^{e\, \times \,M\, \times \,3} \). Then similar to KimCNN [12] introduced in Sect. 2.2, we use multiple convolution kernels \( {\text{h}}\, \in \,R^{e\, \times \,k\, \times \,3} \) to extract a particular local pattern in the action sequence, while \( k\left( {k\, \le \,M} \right) \) is window size. The local activation of the submatrix \( V_{n:n + k - 1} \) with respect to the convolution kernel \( {\text{h}} \) can be recorded as:

$$ c_{n}^{h} \, = \,f\left( {h *V_{n:n + k - 1} + b} \right) \left( {0\, \le \,n\, \le \, M\, - \,k\, + \,1} \right), $$
(4)

where \( f \) is the nonlinear function and * is the convolution operator and \( b \) is the bias.

Then we use the max pooling operation on the feature map of the output as:

$$ \widetilde{c}^{h} \, = \,{ \hbox{max} }\left\{ {c_{1}^{h} , c_{2}^{h} , \ldots , c_{M - k + 1}^{h} } \right\} $$
(5)

All the features are concatenated together to form the final representation \( a\left( {u,j} \right) \) of the student \( u \)’s online behavior in \( j^{th} \left( {0 \, \le \,j\, \le \,13} \right) \) week \( {\text{a}}\left( {{\text{u}},{\text{j}}} \right)\, = \,\left[ {\widetilde{c}^{{h_{1} }} \widetilde{c}^{{h_{2} }} \ldots \widetilde{c}^{{h_{m} }} } \right] \), where \( m \) is the number of kernels. The weekly online behavioral representation will be passed into the bi-LSTM with static information.

4.4 Static Characteristics Component

This component builds a simple effective strategy to incorporate group id \( Z_{g} \) and cluster pattern \( Z_{p} \) into SPDN. Since these characteristics are categorical values, we model them into one-hot vectors as \( {\text{l}}\left( {Z_{g} } \right)\, \in \,\left\{ {0,1} \right\}^{{L_{g} }} \) and \( {\text{l}}\left( {Z_{p} } \right)\, \in \,\left\{ {0,1} \right\}^{{L_{p} }} \), where \( L_{g} \) is the number of students learning group and \( L_{p} \) is the cluster pattern types. And embed the group encoding vector \( {\text{l}}\left( {Z_{g} } \right) \) and convert it into a low-dimensional embedding vector \( v_{{Z_{g} }} \, \in \,R^{{e_{g} }} \), where \( e_{g} \) is the dimension of the embedding vector. The student \( u \)’s static characteristics can be represented by \( \widehat{Z}\left( u \right)\, = \,\left[ {v_{{Z_{g} }} \,{ \oplus }\,{\text{l}}\left( {Z_{p} } \right)} \right]\, \in \,R^{{e_{g} + L_{p} }} \). Then we join the same static feature vectors \( \widehat{Z}\left( u \right) \) with students’ weekly behavioural representation \( {\text{a}}\left( {{\text{u}},{\text{j}}} \right) \) as shown in Fig. 2. Let \( \widehat{X}\, = \,\widehat{X}_{u}^{\left( 1 \right)} \, { \oplus }\,\widehat{X}_{u}^{\left( 2 \right)} \,{ \oplus } \ldots { \oplus }\widehat{X}_{u}^{\left( j \right)} \) represents the augmented feature vector, where each \( \widehat{X}_{u}^{\left( j \right)} \, \in \,R^{{e_{g} + L_{p} + k}} \) is a fused feature group which consists of student \( u \)’s weekly online behavioural representation \( a\left( {u,j} \right) \) and his or her static characteristics: \( \widehat{X}_{u}^{\left( j \right)} \, = \,\left[ { a\left( {u,j} \right)\,{ \oplus }\,\widehat{Z}\left( u \right)} \right] \).

4.5 Bi-LSTM and Prediction

The fused feature groups of each week are passed into bi-LSTM [8] and the output vectors are formed by concatenating each forward and backward direction outputs. The purpose of bi-LSTM is to make full use of context information and prevent gradient explosion. Then a max pooling layer is added to learn the most relevant part of the event embedding sequence and the output is fed into a fully connected layer and a Softmax layer sequentially to estimate the student’s at-risk probability \( \widehat{y}\left( {\text{u}} \right)\, \in \,\left[ {0,1} \right] \).

The parameters to be updated in the whole framework SPDN mainly come from four parts, 1) embedding layers parameters. 2) CNN parameters. 3) bi-LSTM parameters and 4) fully connected layer parameters. All the parameters can be learned by minimizing the follow binary cross entropy objective function:

$$ {\text{L}}\left(\uptheta \right)\, = \, - \sum\nolimits_{u \in U} {\left[ {y\left( u \right)\log \left( {\widehat{y}\left( {\text{u}} \right)} \right)\, + \,\left( {1 - y\left( u \right)} \right){ \log }\left( {1 - \widehat{y}\left( {\text{u}} \right)} \right)} \right],} $$
(6)

where \( \uptheta \) denotes the set of model parameters, \( \widehat{y}\left( {\text{u}} \right) \) is the probability of student at risk, \( y\left( u \right) \) is the corresponding ground truth, U is the set of the whole students.

5 Experiments

We conduct various experiments to evaluate the effectiveness of SPDN on the online action datasets of 505 anonymous students and adopt Adam to optimize the model.

5.1 Setting

In our experiments, we divide all behaviour sequences into 13 weeks and encode respectively as input series. The inter-event time interval is an hour. The embedding dimensions of Internet access activity, online learning activity and time difference are 100 while the dimension of embedding group id is 50 and the cluster patterns are one-hot encoding. In the MFCNN, we use 64 different kernels which the window size of kernel is 1. The bi-LSTM with forward and backward LSTM layers containing 64 cell dimensions per direction is used. In addition, batch normalization layer [9] is applied to the bi-LSTM output and fully connected layer output. It can avoid gradient disappearance problems and speed up the training with a mini-batch size of 64. We divide 64% of the data set into a training set, 16% is a validation set, and 20% is a test set.

All the parameters above are the best group in all experiments with the grid search. Since the true binary target label is imbalanced, the evaluation metrics include Accuracy, Area Under the ROC Curve (AUC) and F1 Score (F1).

5.2 Baseline Models

In order to assess how much added value is brought by the SPDN, we set several baseline models to compare.

To compare with the universal deep learning method, we take the BLSTM_MA (Bidirectional Long Short-Term Memory with Multiple Activity) as a baseline model. We encode and embed the activities in the same way as SPDN, however, in order to show the effect of MFCNN, we align and stack the Internet access activity embeddings, online learning activity embeddings and time difference embeddings vector matrices in multi-channel and use the max pooling operation to get the features on each dimension instead of extracting the features by CNN. In addition, other parameters are consistent with the experimental parameters of SPDN.

For other baseline models LR (logistic regression model), NB (Naive Bayesian), DT (Decision Tree) and RF (random forest), we use the bag of words (BoW) model to represent each student’s past event sequence. After transforming all students’ activities into a BoW model, we count the number of each unique activity appearing in weekly sequence as the part of input. The group id and cluster pattern are other parts of input. The purpose of these experiments is to demonstrate the effectiveness of deep learning.

5.3 Prediction Performance

Table 3 presents the results on the test set for all comparison methods. Overall, SPDN gets the best performance on the dataset. Furthermore, BLSTM_MA and SPDN have the clearly better performance than other traditional machine learning algorithms, that means deep learning models can automatically get more effective information from the activity sequence. Moreover, as the F1 score is a weighted average of both precision and recall, thus it provides more comprehensive evaluation of the model. In our problem, the higher F1 of positive sample is excepted. As can be observed clearly, SPDN gets higher F1 score of positive samples than BLSTM_MA, so it shows that extracting features through MFCNN can provide advantages for predicting positive samples.

Table 3. Overall results

In order to identify the importance of different kinds of engagement activities in this task, we conduct feature ablation experiments for three parts of input, i.e. online learning activity, Internet access activity and static characteristics. Specially, we first input three parts of input to the SPDN, then remove every type of activity one by one to observe the variety of performance.

The results are shown in Table 4. We can observe that all three inputs are useful in this task, especially static information. Because when it is removed, the experimental result of AUC steeply drops to 0.7419. Furthermore, Internet access activity play a more important role, while the student’s online learning activity is sparser than Internet access activity, so it is less important.

Table 4. Contribution analysis for different engagement activities

5.4 Early Prediction

As shown in the Fig. 3, with the accumulation of activity sequences, the performance of the SPDN and baseline models gradually improve from the perspective of AUC. But it can be clearly seen that the deep learning model always has a higher AUC than the general machine learning model (Fig. 3 only shows one of machine learning baseline models, RF, and others have the similar trend). Meanwhile, one of deep models BLSTM_MA requires 11 weeks of student data to achieve the same performance as SPDN is able to achieve significant prediction-quality improvements within the first seven weeks of the semester. It illustrates that MFCNN can form the suitable weekly representation vector of user behaviour and extract features from a long behaviour sequence. In this way, SPDN can be used to early prediction and it can promote early intervention by teachers.

Fig. 3.
figure 3

Comparisons of the SPDN and baseline models in terms of mean AUC for early prediction

6 Conclusion

In this paper, we propose the model named SPDN, which fully uses online learning activities and Internet access activities and joins the static information to predict the performance of the students based on bi-LSTM. Through the experiments on the dataset of a university project-based course and the anonymous student’s network logging records, the results show that SPDN gets the best performance and can achieve results close to the final value within the early weeks to find the at-risk students in time. Meanwhile, Internet access activities have a greater impact on students’ academic performance prediction. In the future, we can combine more courses information into the model to make it more scalable.