Adaptive Forgetting Curves for Spaced Repetition Language Learning

Zaidi, Ahmed; Caines, Andrew; Moore, Russell; Buttery, Paula; Rice, Andrew

doi:10.1007/978-3-030-52240-7_65

Ahmed Zaidi¹³,
Andrew Caines¹³,
Russell Moore¹³,
Paula Buttery¹³ &
…
Andrew Rice¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12164))

Included in the following conference series:

International Conference on Artificial Intelligence in Education

8377 Accesses
7 Citations

Abstract

The forgetting curve has been extensively explored by psychologists, educationalists and cognitive scientists alike. In the context of Intelligent Tutoring Systems, modelling the forgetting curve for each user and knowledge component (e.g. vocabulary word) should enable us to develop optimal revision strategies that counteract memory decay and ensure long-term retention. In this study we explore a variety of forgetting curve models incorporating psychological and linguistic features, and we use these models to predict the probability of word recall by learners of English as a second language. We evaluate the impact of the models and their features using data from an online vocabulary teaching platform and find that word complexity is a highly informative feature which may be successfully learned by a neural network model.

You have full access to this open access chapter, Download conference paper PDF

English Vocabulary Learning System Based on Repetitive Learning and Rate-Matching Rule

How does error correction occur during lexical learning?

Article 15 March 2024

Learning Cognitive Models Using Neural Networks

Keywords

1 Introduction

Optimal human learning techniques have been extensively studied by researchers in psychology [4] and computer science [8, 16, 19, 20]. The impact of learning techniques can be measured by how they affect the long-term retention of the learning materials. Measuring retention requires a model of the human forgetting curve, which plots the probability of recall over time. The first version of the forgetting curve was defined by Ebbinghaus [5] but has since been developed further by many researchers who have incorporated additional psychologically grounded variations to the model [3, 9, 13, 14, 17]. The ideal forgetting curve should adapt to learning materials as well as user meta-features (including current ability). In this study we examine the task of vocabulary learning. We investigate a range of linguistically motivated features, meta-features, and a variety of models in order to predict the probability a given learner will correctly recall a particular word.

2 Method

We use the Duolingo spaced repetition dataset [15] in order to train and evaluate our features and variety of models. The dataset is filtered for English language learners which results in approximately 4.28 million learner-word datapoints. Our models are a modification of the half-life regression model proposed by Settles and Meeder [16].

2.1 Half-Life Regression (HLR)

The half-life regression model is defined as follows:

$$\begin{aligned} p = 2^{-\varDelta /h} \end{aligned}$$

(1)

where p is the probability of recall, $\varDelta $ is the time since last seen (days) and h is the half-life or strength of the learner’s memory. We denote the estimated half-life by $\hat{h}_{\varTheta }$, and it is defined as:

$$\begin{aligned} \hat{h}_{\varTheta } = 2^{\varTheta \cdot \mathbf {x}} \end{aligned}$$

(2)

where $\varTheta $ is a vector of weights for the features $\mathbf {x}$. The features of the model are made up of lexeme tags, one tag for each word in the vocabulary (e.g. the lexeme tag for word camera is camera.N.SG). The aim of these features is to capture the inherent difficulty of the word.

The HLR model is trained using the following loss function:

$$\begin{aligned} \ell (\mathbf {x};\varTheta ) = (p - \hat{p}_{\varTheta })^2 + (h - \hat{h}_{\varTheta })^2 + \lambda ||\varTheta ||^{2}_{2} \end{aligned}$$

(3)

In practice, it was found that optimising for both p and h in the loss function improved the model. The true value of h is defined as $h = \frac{-\varDelta }{log(p)}$. p and $\hat{p_{\varTheta }}$ are the true probability and model estimated probability of recall, respectively.

2.2 HLR with Linguistic/Psychological Features (HLR+)

We now expand on the HLR model by adding additional linguistic, psychological and meta-features to $\mathbf {x}$. We refer to this model as HLR+. The features include word complexity scores estimated by a pre-trained model [6], mean concreteness scores and percent known based on human judgements [2], SUBTLEX word frequencies [18] and user ids.

The motivation for including complexity as a feature is based on the intuition that the more complex the word, the harder it is to remember. Concreteness is included based on previous work showing that concrete words are easier to remember than abstract words because they activate perceptual memory codes in addition to verbal codes [10]. SUBTLEX is the relative frequency of an English word based on a corpus of 201.3 million words: we hypothesise that more frequent words are more likely to be encountered and reinforced during the time since last seen $\varDelta $. Similarly, we expect that ‘percent known’ (the proportion of respondents familiar with each word based on survey data) will correlate with probability of recall. Lastly, we include user id to capture latent behavioural aspects about the learners.

2.3 Complexity-Based Half-Life Regression (C-HLR+)

In addition to adding new features, we now describe a new model that modifies the p such that it directly incorporates word complexity. Gooding et al. [6] derived word complexity to express perceived difficulty. We hypothesise that this will correlate with probability of recall. As the complexity of the word rises, the forgetting curve will become steeper. Therefore, the new model is as follows:

$$\begin{aligned} p = 2^{-\varDelta \cdot C_{i}/h} \end{aligned}$$

(4)

where C is the mean complexity for word i. We define estimated half-life $\hat{h}_{\varTheta }$ as $2^{\varTheta \cdot \mathbf {x}}$ where $\mathbf {x}$ is a vector composed of all of the features described in Sect. 2.2.

2.4 Neural Half-Life Regression (N-HLR+)

Motivated by the recent success of neural networks, we now describe the N-HLR+ model which replaces $\hat{h}_{\varTheta } = 2^{\varTheta \cdot \mathbf {x}}$ with a neural network. The network can be described as follows:

$$\begin{aligned} \hat{h_{\varTheta }} = ReLU(\mathbf {x} \cdot \mathbf {w_{1}})\cdot \mathbf {w_{2}} \end{aligned}$$

(5)

where the network contains a single hidden layer. $\mathbf {x}$ is a vector of input features, $\mathbf {w_{1}}$ is the weight matrix between the inputs and the hidden layer and $\mathbf {w_{2}}$ is the weight matrix between the hidden layer and the output. We use the same loss function as HLR which optimises for both p and h.

2.5 Evaluation and Implementation

We use mean absolute error (MAE) of probability of recall for a lexical item as our evaluation metric which, despite some known problems [11], is in line with previous work [16]. MAE is defined as: $\frac{1}{D}\sum _{D}^{i=1}\left| p - \hat{p_{\varTheta }} \right| _{i}$, where D is the total data instances.

We divided the Duolingo English data into 90% training and 10% test. We trained all non-neural models (e.g. HLR, HLR+, C-HLR) using the following parameters which were tuned on the first 500k data points—learning rate: 0.001, alpha $\alpha $: 0.01, $\lambda $: 0.1. For all neural models (e.g. N-HLR), we used—learning rate: 0.001, epochs: 200, hidden dim: 4.

3 Results and Discussion

We can see in Table 1 that HLR+ did not perform much better than HLR. By modifying the loss function to include complexity as a parameter in the C-HLR+ model, we considerably improved the performance of our model. This was in line with our hypothesis that more complex words are forgotten faster and thus are an important feature in modelling the forgetting curve.

The N-HLR+ model provided additional improvements to the C-HLR+ model. This is due to the fact that neural models are better at capturing non-linearities between the features and the expected output. Furthermore, when compared to the N-HLR+ model we can see that including complexity into the loss function (CN-HLR+) provides no clear improvements in performance. This is because the model learns to place more importance on the complexity feature. We confirm this by analysing the average weights in the hidden layer of the model. The model learns to give greater importance to word complexity, percent known, and concreteness respectively. It does not however, learn much from the user id and SUBTLEX. This is probably due to the fact that a single dimension for capturing user behaviour is not sufficient and that SUBTLEX does not adequately represent learners’ experience with English as a second language.

Table 1. Evaluation of forgetting curve models. Pimsleur and Leitner are previous methods of modelling the forgetting curve.

Full size table

4 Conclusion

We present a new model for adaptively learning a forgetting curve for language learning using a modified HLR loss function and a neural network. We incorporate linguistically and psychologically motivated features and show that word complexity is an important feature in predicting probability of recall for a vocabulary item. Furthermore, we illustrate that neural networks can capture the importance of word complexity while a simple HLR fails to take advantage of that signal. This work lays the foundation for work in neural approaches to understanding language learning over time. Future work in this area includes incorporating high-dimensional user embeddings to capture user specific signals that might influence the forgetting curve, and also different models such as Pareto and power functions which have been proposed in prior work [1].

References

Averell, L., Heathcote, A.: The form of the forgetting curve and the fate of memories. J. Math. Psychol. 55, 25–35 (2011)
Article MathSciNet Google Scholar
Brysbaert, M., Warriner, A.B., Kuperman, V.: Concreteness ratings for 40 thousand generally known English word lemmas. Behav. Res. Methods 46(3), 904–911 (2013). https://doi.org/10.3758/s13428-013-0403-5
Article Google Scholar
Choffin, B., Popineau, F., Bourda, Y., Vie, J.: DAS3H: modeling student learning and forgetting for optimally scheduling distributed practice of skills. In: Proceedings of The 12th International Conference on Educational Data Mining (EDM) (2019)
Google Scholar
Dunlosky, J., Rawson, K.A., Marsh, E.J., Nathan, M.J., Willingham, D.T.: Improving students’ learning with effective learning techniques: promising directions from cognitive and educational psychology. Psychol. Sci. Public Interest 14(1), 4–58 (2013)
Article Google Scholar
Ebbinghaus, H.: Ueber das gedächtnis (1885)
Google Scholar
Gooding, S., Kochmar, E.: Complex word identification as a sequence labelling task. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1148–1153 (2019)
Google Scholar
Leitner, S.: So lernt man lernen: angewandte Lernpsychologie-ein Weg zum Erfolg. Herder (1972)
Google Scholar
Moore, R., Caines, A., Elliott, M., Zaidi, A., Rice, A., Buttery, P.: Skills embeddings: a neural approach to multicomponent representations of students and tasks. In: Proceedings of The 12th International Conference on Educational Data Mining (EDM), vol. 360, p. 365. ERIC (2019)
Google Scholar
Mozer, M.C., Wiseheart, M., Novikoff, T.P.: Artificial intelligence to support human instruction. Proc. Natl. Acad. Sci. 116(10), 3953–3955 (2019)
Article Google Scholar
Paivio, A.: Imagery and Verbal Processes. Psychology Press, Hove (2013)
Book Google Scholar
Pelánek, R.: Metrics for evaluation of student models. J. Educ. Data Min. 7(2), 1–19 (2015)
Google Scholar
Pimsleur, P.: A memory schedule. Modern Lang. J. 51(2), 73–75 (1967)
Article Google Scholar
Reddy, S., Levine, S., Dragan, A.: Accelerating human learning with deep reinforcement learning. In: NeurIPS Workshop: Teaching Machines, Robots, and Humans (2017)
Google Scholar
Rubin, D.C., Wenzel, A.E.: One hundred years of forgetting: a quantitative description of retention. Psychol. Rev. 103(4), 734 (1996)
Article Google Scholar
Settles, B.: Replication data for: a trainable spaced repetition model for language learning (2017). https://doi.org/10.7910/DVN/N8XJME
Settles, B., Meeder, B.: A trainable spaced repetition model for language learning. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1848–1858 (2016)
Google Scholar
Tabibian, B., Upadhyay, U., De, A., Zarezade, A., Schölkopf, B., Gomez-Rodriguez, M.: Enhancing human learning via spaced repetition optimization. Proc. Natl. Acad. Sci. 116(10), 3988–3993 (2019)
Article MathSciNet Google Scholar
Van Heuven, W.J., Mandera, P., Keuleers, E., Brysbaert, M.: SUBTLEX-UK: a new and improved word frequency database for British English. Q. J. Exp. Psychol. 67(6), 1176–1190 (2014)
Article Google Scholar
Zaidi, A.H., Caines, A., Davis, C., Moore, R., Buttery, P., Rice, A.: Accurate modelling of language learning tasks and students using representations of grammatical proficiency. In: Proceedings of The 12th International Conference on Educational Data Mining (EDM) (2019)
Google Scholar
Zaidi, A.H., Moore, R., Briscoe, T.: Curriculum Q-learning for visual vocabulary acquisition. In: Proceedings of Visually-Grounded Interaction and Language (ViGIL). NeurIPS (2017)
Google Scholar

Download references

Acknowledgements

This paper reports on research supported by Cambridge Assessment, University of Cambridge.

Author information

Authors and Affiliations

ALTA Institute and Department of Computer Science and Technology, University of Cambridge, 15 JJ Thomson Avenue, Cambridge, CB3 0FD, UK
Ahmed Zaidi, Andrew Caines, Russell Moore, Paula Buttery & Andrew Rice

Authors

Ahmed Zaidi
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Caines
View author publications
You can also search for this author in PubMed Google Scholar
Russell Moore
View author publications
You can also search for this author in PubMed Google Scholar
Paula Buttery
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Rice
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ahmed Zaidi .

Editor information

Editors and Affiliations

Federal University of Alagoas, Maceió, Brazil
Ig Ibert Bittencourt
University College London, London, UK
Mutlu Cukurova
Carleton University, Ottawa, ON, Canada
Kasia Muldner
University College London, London, UK
Rose Luckin
University of Malaga, Málaga, Spain
Eva Millán

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zaidi, A., Caines, A., Moore, R., Buttery, P., Rice, A. (2020). Adaptive Forgetting Curves for Spaced Repetition Language Learning. In: Bittencourt, I., Cukurova, M., Muldner, K., Luckin, R., Millán, E. (eds) Artificial Intelligence in Education. AIED 2020. Lecture Notes in Computer Science(), vol 12164. Springer, Cham. https://doi.org/10.1007/978-3-030-52240-7_65

Download citation

DOI: https://doi.org/10.1007/978-3-030-52240-7_65
Published: 30 June 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-52239-1
Online ISBN: 978-3-030-52240-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Adaptive Forgetting Curves for Spaced Repetition Language Learning

Abstract

Similar content being viewed by others

English Vocabulary Learning System Based on Repetitive Learning and Rate-Matching Rule

How does error correction occur during lexical learning?

Learning Cognitive Models Using Neural Networks

Keywords

1 Introduction

2 Method

2.1 Half-Life Regression (HLR)

2.2 HLR with Linguistic/Psychological Features (HLR+)

2.3 Complexity-Based Half-Life Regression (C-HLR+)

2.4 Neural Half-Life Regression (N-HLR+)

2.5 Evaluation and Implementation

3 Results and Discussion

4 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Adaptive Forgetting Curves for Spaced Repetition Language Learning

Abstract

Similar content being viewed by others

English Vocabulary Learning System Based on Repetitive Learning and Rate-Matching Rule

How does error correction occur during lexical learning?

Learning Cognitive Models Using Neural Networks

Keywords

1 Introduction

2 Method

2.1 Half-Life Regression (HLR)

2.2 HLR with Linguistic/Psychological Features (HLR+)

2.3 Complexity-Based Half-Life Regression (C-HLR+)

2.4 Neural Half-Life Regression (N-HLR+)

2.5 Evaluation and Implementation

3 Results and Discussion

4 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation