Keywords

1 Introduction

Optimal human learning techniques have been extensively studied by researchers in psychology [4] and computer science [8, 16, 19, 20]. The impact of learning techniques can be measured by how they affect the long-term retention of the learning materials. Measuring retention requires a model of the human forgetting curve, which plots the probability of recall over time. The first version of the forgetting curve was defined by Ebbinghaus [5] but has since been developed further by many researchers who have incorporated additional psychologically grounded variations to the model [3, 9, 13, 14, 17]. The ideal forgetting curve should adapt to learning materials as well as user meta-features (including current ability). In this study we examine the task of vocabulary learning. We investigate a range of linguistically motivated features, meta-features, and a variety of models in order to predict the probability a given learner will correctly recall a particular word.

2 Method

We use the Duolingo spaced repetition dataset [15] in order to train and evaluate our features and variety of models. The dataset is filtered for English language learners which results in approximately 4.28 million learner-word datapoints. Our models are a modification of the half-life regression model proposed by Settles and Meeder [16].

2.1 Half-Life Regression (HLR)

The half-life regression model is defined as follows:

$$\begin{aligned} p = 2^{-\varDelta /h} \end{aligned}$$
(1)

where p is the probability of recall, \(\varDelta \) is the time since last seen (days) and h is the half-life or strength of the learner’s memory. We denote the estimated half-life by \(\hat{h}_{\varTheta }\), and it is defined as:

$$\begin{aligned} \hat{h}_{\varTheta } = 2^{\varTheta \cdot \mathbf {x}} \end{aligned}$$
(2)

where \(\varTheta \) is a vector of weights for the features \(\mathbf {x}\). The features of the model are made up of lexeme tags, one tag for each word in the vocabulary (e.g. the lexeme tag for word camera is camera.N.SG). The aim of these features is to capture the inherent difficulty of the word.

The HLR model is trained using the following loss function:

$$\begin{aligned} \ell (\mathbf {x};\varTheta ) = (p - \hat{p}_{\varTheta })^2 + (h - \hat{h}_{\varTheta })^2 + \lambda ||\varTheta ||^{2}_{2} \end{aligned}$$
(3)

In practice, it was found that optimising for both p and h in the loss function improved the model. The true value of h is defined as \(h = \frac{-\varDelta }{log(p)}\). p and \(\hat{p_{\varTheta }}\) are the true probability and model estimated probability of recall, respectively.

2.2 HLR with Linguistic/Psychological Features (HLR+)

We now expand on the HLR model by adding additional linguistic, psychological and meta-features to \(\mathbf {x}\). We refer to this model as HLR+. The features include word complexity scores estimated by a pre-trained model [6], mean concreteness scores and percent known based on human judgements [2], SUBTLEX word frequencies [18] and user ids.

The motivation for including complexity as a feature is based on the intuition that the more complex the word, the harder it is to remember. Concreteness is included based on previous work showing that concrete words are easier to remember than abstract words because they activate perceptual memory codes in addition to verbal codes [10]. SUBTLEX is the relative frequency of an English word based on a corpus of 201.3 million words: we hypothesise that more frequent words are more likely to be encountered and reinforced during the time since last seen \(\varDelta \). Similarly, we expect that ‘percent known’ (the proportion of respondents familiar with each word based on survey data) will correlate with probability of recall. Lastly, we include user id to capture latent behavioural aspects about the learners.

2.3 Complexity-Based Half-Life Regression (C-HLR+)

In addition to adding new features, we now describe a new model that modifies the p such that it directly incorporates word complexity. Gooding et al. [6] derived word complexity to express perceived difficulty. We hypothesise that this will correlate with probability of recall. As the complexity of the word rises, the forgetting curve will become steeper. Therefore, the new model is as follows:

$$\begin{aligned} p = 2^{-\varDelta \cdot C_{i}/h} \end{aligned}$$
(4)

where C is the mean complexity for word i. We define estimated half-life \(\hat{h}_{\varTheta }\) as \(2^{\varTheta \cdot \mathbf {x}}\) where \(\mathbf {x}\) is a vector composed of all of the features described in Sect. 2.2.

2.4 Neural Half-Life Regression (N-HLR+)

Motivated by the recent success of neural networks, we now describe the N-HLR+ model which replaces \(\hat{h}_{\varTheta } = 2^{\varTheta \cdot \mathbf {x}}\) with a neural network. The network can be described as follows:

$$\begin{aligned} \hat{h_{\varTheta }} = ReLU(\mathbf {x} \cdot \mathbf {w_{1}})\cdot \mathbf {w_{2}} \end{aligned}$$
(5)

where the network contains a single hidden layer. \(\mathbf {x}\) is a vector of input features, \(\mathbf {w_{1}}\) is the weight matrix between the inputs and the hidden layer and \(\mathbf {w_{2}}\) is the weight matrix between the hidden layer and the output. We use the same loss function as HLR which optimises for both p and h.

2.5 Evaluation and Implementation

We use mean absolute error (MAE) of probability of recall for a lexical item as our evaluation metric which, despite some known problems [11], is in line with previous work [16]. MAE is defined as: \(\frac{1}{D}\sum _{D}^{i=1}\left| p - \hat{p_{\varTheta }} \right| _{i}\), where D is the total data instances.

We divided the Duolingo English data into 90% training and 10% test. We trained all non-neural models (e.g. HLR, HLR+, C-HLR) using the following parameters which were tuned on the first 500k data points—learning rate: 0.001, alpha \(\alpha \): 0.01, \(\lambda \): 0.1. For all neural models (e.g. N-HLR), we used—learning rate: 0.001, epochs: 200, hidden dim: 4.

3 Results and Discussion

We can see in Table 1 that HLR+ did not perform much better than HLR. By modifying the loss function to include complexity as a parameter in the C-HLR+ model, we considerably improved the performance of our model. This was in line with our hypothesis that more complex words are forgotten faster and thus are an important feature in modelling the forgetting curve.

The N-HLR+ model provided additional improvements to the C-HLR+ model. This is due to the fact that neural models are better at capturing non-linearities between the features and the expected output. Furthermore, when compared to the N-HLR+ model we can see that including complexity into the loss function (CN-HLR+) provides no clear improvements in performance. This is because the model learns to place more importance on the complexity feature. We confirm this by analysing the average weights in the hidden layer of the model. The model learns to give greater importance to word complexity, percent known, and concreteness respectively. It does not however, learn much from the user id and SUBTLEX. This is probably due to the fact that a single dimension for capturing user behaviour is not sufficient and that SUBTLEX does not adequately represent learners’ experience with English as a second language.

Table 1. Evaluation of forgetting curve models. Pimsleur and Leitner are previous methods of modelling the forgetting curve.

4 Conclusion

We present a new model for adaptively learning a forgetting curve for language learning using a modified HLR loss function and a neural network. We incorporate linguistically and psychologically motivated features and show that word complexity is an important feature in predicting probability of recall for a vocabulary item. Furthermore, we illustrate that neural networks can capture the importance of word complexity while a simple HLR fails to take advantage of that signal. This work lays the foundation for work in neural approaches to understanding language learning over time. Future work in this area includes incorporating high-dimensional user embeddings to capture user specific signals that might influence the forgetting curve, and also different models such as Pareto and power functions which have been proposed in prior work [1].