survey

Open Access

A Survey on Recent Approaches to Question Difficulty Estimation from Text

Authors:
Luca Benedetto

Politecnico di Milano, Italy

Politecnico di Milano, Italy

0000-0002-5113-4696
View Profile

,
Paolo Cremonesi

Politecnico di Milano, Italy

Politecnico di Milano, Italy

0000-0002-1253-8081
View Profile

,
Andrew Caines

Computer Laboratory & ALTA Inst., University of Cambridge, U.K.

Computer Laboratory & ALTA Inst., University of Cambridge, U.K.

0000-0001-9647-4902
View Profile

,
Paula Buttery

Computer Laboratory & ALTA Inst., University of Cambridge, U.K.

Computer Laboratory & ALTA Inst., University of Cambridge, U.K.

0000-0003-3874-0656
View Profile

,
Andrea Cappelli

Cloud Academy Sagl., Switzerland

Cloud Academy Sagl., Switzerland

0000-0003-1391-5202
View Profile

,
Andrea Giussani

Cloud Academy Sagl., Switzerland

Cloud Academy Sagl., Switzerland

0000-0002-6081-5864
View Profile

,
Roberto Turrin

Cloud Academy Sagl., Switzerland

Cloud Academy Sagl., Switzerland

0000-0001-8191-3159
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 55 Issue 9Article No.: 178pp 1–37https://doi.org/10.1145/3556538

Published:16 January 2023Publication History

ACM Computing Surveys

Abstract

Question Difficulty Estimation from Text (QDET) is the application of Natural Language Processing techniques to the estimation of a value, either numerical or categorical, which represents the difficulty of questions in educational settings. We give an introduction to the field, build a taxonomy based on question characteristics, and present the various approaches that have been proposed in recent years, outlining opportunities for further research. This survey provides an introduction for researchers and practitioners into the domain of question difficulty estimation from text and acts as a point of reference about recent research in this topic to date.

1 INTRODUCTION

Question Difficulty Estimation (QDE)—also referred to as “question calibration”—consists of estimating a value, either numerical or categorical, representing the difficulty of a question, and is of crucial importance in the educational domain. The best way to intuitively understand the importance of an accurate estimation of question difficulty is through some use cases. An example is Computerized Adaptive Testing [62], an examination format in which the students are provided with questions whose difficulty is targeted to their proficiency, which was shown to be highly beneficial to the learning outcome [18]. In case of miscalibrated questions (i.e., whose difficulty has been erroneously estimated), students are provided with questions inappropriate to their level, which affects the learning outcome [108]. Another example is the fact that question difficulty is leveraged for accurately assessing students: Indeed, in some testing frameworks, students’ skill levels are estimated based on their past answers to exam questions and the known difficulty of those questions. A student that correctly answered a very difficult question will have an estimated knowledge level higher than the one of a student who correctly answered a lower difficulty question; therefore, miscalibrated items may affect the accuracy of students’ assessment. Last, regardless of the testing theory that is used in designing the exams, a test that is too easy or too difficult for a particular group results in a limited range of scores, which is not informative [3].

Traditionally, QDE is performed with either (i) manual calibration [1] or (ii) pretesting [58]. Manual calibration consists of having one (or more) domain experts manually selecting a numerical or categorical value representing the difficulty of each question, which is not scalable, intrinsically subjective, and inconsistent. The other approach, pretesting, consists of deploying the new questions in an exam, as if they were standard questions, but without using them to assess students. The other questions in the exam are used to assess the students, and their answers—together with the estimated skill levels—are used to calibrate the questions under pretesting. Even though this approach leads indeed to an accurate and reliable estimation of question difficulty, it introduces a long delay between the time of question generation and when the questions can be used to assess students. Also, it requires the new questions to be shown to students before being actually used to score them, which is undesirable, since they might be leaked or exposed too often [110].

To overcome the limitations of traditional approaches to question calibration, recent research has attempted to leverage the textual content of questions with Natural Language Processing (NLP) techniques to automatically estimate their difficulty. Indeed, question text is the only information that is always available at the time of question creation and, being able to estimate question difficulty from it, we would overcome the need for pretesting, manual calibration, and their limitations. No surveys have been carried out about Question Difficulty Estimation from Text (QDET), which is a research direction that has received increased attention in recent years, thanks to concurrent advancements in NLP. Two existing surveys focused on question generation have referred to some works that addressed the task of QDET [15, 57]. However, such survey papers did not consider papers that performed QDET without focusing on question generation; therefore, a comprehensive review of recent literature on QDET is still lacking.

With this survey article, we have the goal of presenting a comprehensive review of recent (i.e., from 2015) approaches to QDET. Even though the research of techniques to perform QDET and to modify questions difficulty in a controllable manner has a fairly long history [12, 60, 79], we focus only on recent works, since there has been rapid development compared to previous years. Indeed, the past few years have seen an improvement in the capabilities of NLP techniques and this has been reflected in the progress on QDET. Overall, we find that there has been a shift from the usage of theoretically supported features such as readability and word-complexity measures towards approaches that rely upon modern NLP techniques based on machine learning.

In this survey, we aim at creating a single point of reference that can be looked at by any researcher and practitioner working on the task of QDET or approaching it for the first time. We propose a taxonomy based on question format to organize all the approaches published so far and analyze the techniques that have proven effective or not effective in certain scenarios. We do not perform a quantitative comparison of the different approaches, as that is not feasible for several reasons. First, different approaches are generally designed to work on different scenarios—different educational domains, different types of questions, different question formats, different definitions of difficulty, and so on—and, second, due to the value of exam material, there is a lack of publicly available resources, which makes it difficult to exactly reproduce all the proposed approaches. Indeed, most of the papers presented in this survey only compare themselves with simple baselines (random or majority), rather than with previously proposed models.

The contributions of this work can be summarized in the following points: (i) we perform a comprehensive review of recent work on QDET, (ii) we propose a taxonomy to organize such works, (iii) we discuss future research directions and limitations of the proposed approaches. We envision two ways of reading this survey: (i) a read from start to finish, which informs on all the approaches that were proposed in previous research for different types of questions and compares them; and (ii) a read focused on specific types of questions, which can be guided by the proposed taxonomy.

This document is organized as follows: Section 2 describes the research method. Section 3 introduces the testing theories featuring in this survey. Section 4 presents the proposed taxonomy. Sections 5 and 6 dive into the details and describe the approaches proposed in the literature. Section 7 presents some additional analyses, along different dimensions, of the papers presented in this survey. Section 8 concludes the article.

2 RESEARCH METHOD

To retrieve the works included in this survey, we proceeded as follows:

We performed a comprehensive search on digital libraries (AAAI, ACL, ACM, Elsevier, IEEE, Springer) using relevant keywords; to not miss relevant papers, we also performed the same search on Google Scholar, since it offers wider coverage of works on the Internet, but we only retained peer-reviewed research publications.
We manually filtered all the papers, keeping the ones that satisfy the following criteria: (i) they propose and evaluate approaches that leverage textual information to perform QDE, either as the final target or as an intermediate step; (ii) they focus on the educational domain; (iii) they have been peer reviewed; (iv) they were published in 2015 or later; (v) they are written in English.
We collected all the papers that cited or were cited by the remaining papers; we made use of the citation network, because we observed that the initial keywords—although chosen to include as many relevant works as possible—still failed to retrieve some papers that are relevant to this survey.
We filtered the resulting papers using the same criteria as before.

Several recent works proposed models to create question embeddings from text, claiming that they are capable of capturing several question characteristics, including difficulty. The final target of these papers is not QDET and no experiments are performed to support the claim that such embeddings are capable of capturing question difficulty, therefore they are not included in this survey (e.g., References [49, 50, 65, 75, 95, 103]). Similarly, we consider papers from the recent literature on difficulty-controllable question generation only if they can be used for QDET of already existing question. For instance, we do not analyze Reference [33], which proposed an approach for generating questions of a given difficulty but cannot be used to calibrate existing questions. Last, we do not include the papers that perform QDET in domains other than education, such as community question answering systems [64], the ones that have the target of estimating reading complexity of a piece of text (e.g., Reference [20]), and the ones that consider question difficulty for question answering models instead of human learners [34].

3 THEORETICAL BACKGROUND: THEORIES OF TESTING

All the papers presented in this survey perform QDET, but the definition of difficulty can be diverse. Indeed, three approaches are used: (i) Classical Test Theory, (ii) Item Response Theory, and (iii) manual definitions. Regardless of the theory (if any) used to obtain it, question difficulty can be either a continuous value or a discrete value, therefore the task of QDET can be seen either as a regression task or as a classification (discrete regression) task. It is important to remark here that the decision of which theory to use is an exam design choice outside the scope of this survey. However, below, we summarize the theoretical testing frameworks that are featured in this survey.

3.1 Classical Test Theory (CTT)

CTT [38] is a well-established testing theory that predicts outcomes of psychological testing, such as the difficulty of items or the ability of test-takers. The term “classical” refers to the contrast with modern psychometric theories such as IRT, compared to which CTT has the advantage of being simple to compute and to understand.

CTT assumes that each individual is associated with a true score T, which would be the expected correctness of an infinitely long run of repeated independent administrations of the same test. In practice, the observed score X is used, which is the sum of the true score T and an error E: \( X = T + E \), where T and E are two unobservable (or latent) variables. The major assumptions of CTT are that (i) T and E are not correlated, (ii) E is normally distributed with zero mean, and (iii) the errors of different tests are not correlated. The concept of item difficulty of an item in CTT is expressed by the p-value, which is a continuous value in the range \( [0; 1] \). The p refers to “probability” and is the fraction of correct responses in the considered population. The \( \text{p-value} \) is typically referred to as correctness: The higher the \( \text{p-value} \), the easier the item. Similarly, we can define the wrongness as \( 1-\text{p-value} \): The higher the value, the more difficult the item.

The main limitation of CTT is that it does not leverage the students’ skills when estimating the item difficulty: In practice, it simply uses the fraction of students that wrongly answer a question without considering their skill level.

3.2 Item Response Theory (IRT)

IRT [39] is another well-established technique that associates latent traits to both students and questions. Its simplest implementation, the one-parameter model (named the “Rasch Model” [82]), associates a skill level \( \theta \) to each student and a difficulty level b to each question. An important property of IRT is “invariance”: Item latent traits do not depend on the ability distribution of test takers and a given question is assigned the same difficulty regardless of the skill levels of the students answering it (in contrast to CTT, which simply considers the fraction of correct and wrong answers). Two important assumptions of IRT are that (i) individuals are independent from each other and that (ii) the item responses of a given individual are independent from each other.

For a given question j and its latent trait \( b_j \), we can define the item response function (i.r.f.), which indicates the probability (\( \text{P}_{\text{C}} \)) that a student i with skill level \( \theta _i \) answers the question correctly. The formula of the i.r.f. is as follows: (1) \( \begin{equation} \text{P}_{\text{C}} = \frac{1}{1 + e^{-1.7 \cdot (\theta _i - b_j)}} , \end{equation} \) where the coefficient 1.7 was empirically found in previous research to generally lead to accurate results. The background intuition is that a student with a given skill \( \theta _i \) has a lower probability of correctly answering more difficult questions: If a question is too difficult or too easy (i.e., \( b_j \rightarrow \infty \) or \( b_j \rightarrow -\infty \)), then all the students answer in the same way (i.e., \( \text{P}_{\text{C}} \rightarrow 0 \) or \( \text{P}_{\text{C}} \rightarrow 1 \)), which shows why it is important to have assessment items that are not too easy nor too difficult.

Question difficulties obtained in IRT are real values in a given range (selected at the time of calibration) but, in practice, are sometimes converted to discrete values, thus representing difficulty in a discrete manner.

3.3 Manual Definition

In some cases, question difficulty is not based upon any learning theories and it is just manually selected by educational experts. In all these cases—at least considering the papers presented in this survey—difficulty is a discrete value, and the number of possible classes can vary, depending on the specific implementation.

4 TAXONOMY

Figure 1 presents the taxonomy we propose for categorizing all the papers presented in this work. We group the papers depending on the characteristics of the questions that the proposed models work on, since the type of question heavily affects the models that can be used in each application scenario. We provide here a brief overview of the proposed approaches and their categorization and will describe them in detail in Sections 5 and 6.

Fig. 1. The taxonomy based on question format we use for categorizing the papers presented in this survey.

The first distinction is the educational domain considered by each work; there is a crucial difference between (i) Language Assessment (LA), both first and second language, and (ii) Content Knowledge Assessment (CKA), e.g., math. Indeed, question difficulty comes from different sources in these two scenarios: In LA, the difficulty comes from the linguistic demands of the task and the topic being assessed along with any stimulus text, while in CKA the difficulty mostly comes from the topics that are being assessed and the question format has a smaller importance. Moreover, CKA questions are often built to minimize the effects of language on the difficulty [114]. This difference has an influence on the approaches to QDET in the two domains, as they focus on different features: Approaches developed for LA often rely upon theoretically supported measures such as readability formulas and predefined word complexity measures, which are rarely used in CKA; however, CKA works often leverage learned features, such as TF-IDF (Term Frequency–Inverse Document Frequency) [52] and word embeddings (e.g., word2vec [69], ELMo [80]), or end-to-end neural networks, which are much less common in LA. This difference also shows that—generally—research in LA does not focus on semantic word representations, which is instead very important for CKA.

Almost all the proposed approaches to perform QDET, in both domains, address the task as a supervised problem: A training set containing texts and difficulties of exam questions is used to train a model that is capable of performing QDET for previously unseen questions. In some cases, additional textual datasets are used to pre-train the model or part thereof. In such cases, the models built for LA leverage general purpose datasets (e.g., Wikipedia), while the ones built for CKA leverage datasets related to the topics that are assessed by the questions (e.g., books, lecture transcripts).

4.1 Language Assessment (LA)

Focusing on LA, most approaches to QDET deal either with (i) comprehension questions or (ii) knowledge questions. Comprehension questions are provided to the student together with a passage (either written or spoken) that contains the answer to the question, meaning that the task of a comprehension question involves finding in the passage (or inferring from it) the answer to a question. On the contrary, knowledge questions assess the knowledge of the student at a certain time and the answer to the question is not found in (or inferred from) a related passage.¹

4.1.1 Comprehension Questions.

These questions can be categorized into reading comprehension and listening comprehension. Only one of the works presented here focuses on listening comprehension questions [67], while reading comprehension questions received slightly more consideration in previous research, as we found four relevant works that focused on it [10, 47, 48, 61]. We also notice that the papers on comprehension questions are very recent (the first one is dated 2017), and this is most likely due to the fact that recent advancements in NLP techniques based on neural networks have enabled new ways of leveraging the accompanying texts.

4.1.2 Knowledge Questions.

This type of question has received more attention than comprehension questions, and this interest also started before the first research on comprehension questions. This also has an impact on the types of models that are used for QDET of knowledge questions for LA. Indeed, many of the models use fairly simple and theoretically grounded features such as word-complexity for learners of specific languages and readability measures. No end-to-end neural networks were proposed so far, and most of the works did not experiment with word embeddings or word frequency features. Knowledge questions for LA can be further divided depending on their format: some are vocabulary questions made of single words [23, 26, 88, 116], others represent whole sentences [5, 6, 31, 44, 47, 59, 74, 88, 97, 98, 99, 104].

4.1.3 Others.

There are two types of questions that are explored in one paper only [88] and do not really fall in any of the previous categories: (i) elicited speech and (ii) dictation exercises. The elicited speech task evaluates reading and speaking skills of students by requiring them to produce a sentence out loud, while the dictation task consists of asking the students to transcribe an audio recording and thus evaluates both listening and writing skills.²

4.2 Content Knowledge Assessment (CKA)

In CKA, all items are knowledge questions, and can be categorized depending on the content of the questions. Specifically, they can be divided into (i) text only questions and (ii) heterogeneous questions, which contain information—such as images—that cannot be captured at text level. Equations and formulas are generally considered as “text,” since they can be expressed in LaTeX-like verbal format [118]. Questions with images are quite rare and this is reflected by the fact that only three works [30, 94, 119] experimented on QDET for heterogeneous questions. The research focused on text only questions can be categorized depending on the type of information that is leveraged by the models. Specifically, we can distinguish between (i) models that only consider the question text for the task of QDET [8, 9, 27], (ii) models that also leverage texts from other sources (e.g., lecture content, books) [45, 81, 114, 115, 122], and (iii) models that leverage non-textual information (e.g., ontologies [29, 55, 89, 107], knowledge components [21, 102], and others [94, 113]).

Last, there are two works that do not belong to any of the previous categories because they deal with specific types of questions and can be used only in the niches they were designed for. One of them [78] deals with questions whose answers are in the form of First Order Logic formulas and leverages such formulas for QDET. The other [72] performs QDET for short-answer questions and leverages the text of the students’ answers (not of the question).

5 LANGUAGE ASSESSMENT

In this section, we present and discuss the approaches that have been proposed for QDET in the language assessment domain: We focus on reading comprehension questions in Section 5.1, on listening comprehension questions in Section 5.2, on word knowledge questions in Section 5.3, on sentence knowledge questions in Section 5.4, and on elicited speech and dictation items in Section 5.5.

5.1 Reading Comprehension Questions

In reading comprehension questions, students are given a textual passage and one (or more) questions associated with it, as shown in the example in Figure 2.

Fig. 2. Example of reading comprehension question from Reference [47].

The reading passage is an important component—although not the only one—of question difficulty, and this is reflected in the four models proposed in recent years. Indeed, one of them [47] completely bases the estimation of question difficulty on the reading complexity of the reading material, while the others [10, 48, 61] leverage both the text of the question and the text of the accompanying passage. An overview of the four models is shown in Table 1.

Table 1.

Paper	Year	Sources of text	Approach
[47]	2018	Reading passage only.	Reading difficulty directly used as an indication of question difficulty.
[10]	2021	Reading passage, question text.	Five features computed from the text of the question and the passage are normalized, averaged, and then compared to a threshold.
[61]	2019	Reading passage, question text.	Words are embedded with word2vec, the sequences of word embeddings are embedded with LSTMs, the final estimation is done with an FCNN.
[48]	2017	Reading passage, question text, and distractors.	Words are embedded with word2vec, the sequences of word embeddings are embedded with a sentence CNN, an attention mechanism is used to detect the relevant parts of the passage, and the final estimation is done with an FCNN.

Paper	Year	Approach	Question format
[26]	2018	SVM that uses as features word2vec embeddings.	VLT
[116]	2018	SVM that uses as features: word length, word frequency, utilization on the web, Age-of-acquisition, concreteness rating, number of POS tags, most frequent POS tag, word2vec embeddings, number of double consonants, number of vowels, presence of shorter homophones.	Yes/No, VKS, VLT
[88]	2020	Weighted softmax model that uses as features: word length, log-likelihood from character-level language model, Fisher score.	Yes/No

Paper	Year	Uses sentence(s)	Uses gap word(s)	Approach	Question format
[47]	2018	-	✓	Considers word difficulty as question difficulty and obtains it from a manually curated table containing the difficulty of 6,480 words.	cloze
[44]	2019	✓	-	Linear regression model that uses as features mean token length and mean sentence length.	cloze
[104]	2017	✓	✓	Linear regression model that uses as features 25 linguistic variables at passage and item level.	cloze
[31]	2019	✓	-	Shannon’s entropy is used to assign a score to each gap based on the number of candidate words that could fill the gap given the context; the score is used as a direct indication of question difficulty.	cloze
[6]	2015	✓	✓	SVM that uses 70 features from (i) the difficulty of the passage, (ii) the difficulty of the target word, and (iii) test parameters.	prefix deletion, cloze, c-tests
[59]	2019	✓	✓	SVM that uses 59 features reduced from the 70 in Reference [6].	c-tests
[88]	2020	✓	✓	Linear regression model that uses as features: average word length, sentence length, log-likelihood from a language model, and Fisher score.	cloze

Paper	Year	Uses sentence(s)	Uses gap word(s)	Approach	Question format
[47]	2018	-	✓	Uses a table containing 44 pre-evaluated grammar patterns of known difficulty; the difficulty of the question is the difficulty of the corresponding pattern.	CGFI
[74]	2019	✓	✓	Ridge regression, using 36 features from gap and context.	CGFI

Paper	Year	Approach	Question format
[97]	2017	10 features from target word, reading passage, correct answer, and distractors.	CIM
[99]	2019	Features are reading passage difficulty, similarity between correct answer and distractors, and distractor word difficulty level. Two levels (low/high) for each of them, the number of “low” features represents the difficulty (from 0 to 3).	CIM
[98]	2020	Features are target word difficulty, similarity between correct answer and distractors, and distractor word difficulty level. Two levels (low/high) for each of them, the number of “low” features represents the difficulty (from 0 to 3).	CIM

Paper	Year	Features	ML model
[27]	2017	Features from Coh-Metrix grouped in narrativity, syntactic simplicity, word concreteness, referential cohesion, deep cohesion	Linear regression
[9]	2020	TF-IDF	Random forest regressor
[8]	2020	TF-IDF, linguistic features, readability measures	Random forest regressor

Paper	Year	Other texts necessary	Approach
[45]	2018	-	SVM that uses as features the cosine similarity between the word2vec embeddings of stem, correct choice, and distractors. The additional texts are use to further pre-train the word2vec embeddings.
[122]	2020	-	BERT, the additional texts are used to further pre-train the language model.
[7]	2021	-	BERT and DistilBERT, the additional texts are used to further pre-train the language models.
[114]	2019	✓	Random Forest that uses as features: word embeddings (word2vec, ELMo), linguistic features, Information Retrieval-based features. The additional texts are required to compute some of the features.
[115]	2020	✓	Same as Reference [114].
[81]	2019	✓	Two neural networks, which estimate two components of question difficulty (recall difficulty and confusion difficulty); their estimations are then averaged. The additional texts are used by the “recall” component of the model.

Paper	Year	Approach
[107]	2015	Difficulty is defined as the similarity between the correct choice and the distractors.
[89]	2017	Logistic regression model that uses 15 features related to (i) entity salience (a proxy of entity popularity) and (ii) coherence of entity pairs (i.e., their tendency to appear in the same context).
[29]	2018	Defines difficulty as the inverse of the average popularity of the entities in the question.
[55]	2019	Difficulty obtained from the confidence and selectivity of the question.

Paper	Year	Approach
[21]	2019	Two components: (i) LSTM that receives the text of the question, (ii) attention-based model that captures relevance between texts and knowledge components. Then, average pooling.
[102]	2020	Pre-trained BERT to embed questions and TextCNN to perform QDE.

Paper	Year	Approach
[94]	2016	Studies how the presence of images, tables, formulas, and some textual features (text length, presence of specialist terms and abstract concepts) affect item difficulty.
[30]	2019	ResNet for extracting image representations, BERT for embedding textual content. Capsule Neural network to obtain a fixed-length vector that represents the exercise. Bayesian inference-based softmax regression classifier to perform estimation.
[119]	2019	(i) embedding of heterogeneous content (word2vec for texts, convolutional layers for images, fully connected layers for metadata), (ii) BiLSTM, (iii) self-attention, and (iv) max pooling to obtain pre-trained question representations. Fine-tuning on QDE with an FCNN.

Algorithm/Feature	Papers
Word difficulty	[6, 59, 98, 99]
Reading difficulty or readability indexes	[6, 8, 47, 59, 99]
Word length, passage length, or similar linguistic features	[8, 10, 23, 44, 88, 94, 104, 114, 115, 116]
Word frequency	[23, 116]
TF-IDF	[8, 9]
Word2vec embedding	[26, 45, 48, 61, 114, 115, 116, 119]
ELMo embedding	[113, 114, 115]
Attention-based Neural Network	[21, 30, 48, 102, 119, 122]
Fully Connected Neural Network	[48, 61, 113, 119]
LSTM or BiLSTM	[21, 61, 113, 119]
BERT	[30, 102, 122]
Convolutional Neural Network	[48, 102, 119]
Random Forest	[8, 9, 67, 114, 115]
SVM	[6, 26, 45, 59, 116]
Linear Regression	[27, 44, 104]
Similarity measures	[45, 98, 99, 107]

Metric	Papers
Pearson’s Correlation Coefficient	[6, 23, 31, 44, 48, 59, 67, 74, 88, 97, 104, 119]
Root Mean Squared Error	[8, 9, 26, 48, 59, 74, 81, 113, 114, 115, 119]
Mean Absolute Error	[8, 9, 114, 119]
R squared (R2)	[72, 97]
Degree of Agreement	[48, 119]
F1 score	[115]
Mean Squared Error	[9]
Quadratic Weighted Kappa	[59]
Passing Rate	[48]
Spearman rank Correlation Coefficient	[81]
Kendall rank Correlation Coefficient	[81]
Area Under the ROC Curve on SAP	[21, 102]
Accuracy on SAP	[21]
Accuracy	[29, 30, 45, 55, 89, 116, 122]
F1 score	[10, 78, 122]
Pearson’s Correlation Coefficient	[98, 99]
Confusion Matrix	[45, 61]
Accuracy on SAP	[47]
Precision	[78]
Recall	[78]
Fleiss Kappa	[89]

Paper	Year	Educational Domain	Difficulty format	Testing theory	Only features from text?	Additional features	Additional feat. necessary?	Only Q text	Additional texts domain specific	Additional texts necessary	Natural language	QDE final target
[6]	2015	LA	C	CTT	✗	Predefined tables	✓	✓	-	-	En, Fr, De	✓
[8]	2020	CKA	C	IRT	✓	-	-	✓	-	-	En	✓
[9]	2020	CKA	C	IRT	✓	-	-	✓	-	-	En	✓
[10]	2021	LA	D	Oth.	✓	-	-	✓	-	-	En	✗
[21]	2019	CKA	C	Oth.	✗	Q-matrix, test logs	✓	✓	-	-	En	✗
[23]	2015	LA	C	IRT	✓	-	-	✗	✗	✓	En	✓
[26]	2018	LA	C	IRT	✓	-	-	✗	✗	✗	En	✓
[27]	2017	CKA	C	IRT	✓	-	-	✓	-	-	En	✓
[29]	2018	CKA	D	Oth.	✗	Knowledge graph	✓	✓	-	-	En	✗
[30]	2019	CKA	D	Oth.	✗	Images	✓	✓	-	-	En	✓
[31]	2019	LA	C	Oth.	✓	-	-	✗	✗	✗	En	✓
[44]	2019	LA	C	Oth.	✓	-	-	✓	-	-	Ru	✗
[45]	2018	CKA	D	IRT	✓	-	-	✗	✓	✗	Ch	✓
[48]	2017	LA	C	CTT	✓	-	-	✓	-	-	En	✓
[47]	2018	LA	D	Oth.	✓	-	-	✗	✗	✓	En	✗
[55]	2019	CKA	D	Oth.	✗	Knowledge graph	✓	✓	-	-	En	✗
[59]	2019	LA	C	CTT	✓	-	-	✓	-	-	En	✗
[61]	2019	LA	D	IRT, Oth.	✓	-	-	✗	✗	✓	Ch	✓
[67]	2016	LA	C	Oth.	✗	Audio	✓	✗	✓	✓	En	✓
[72]	2017	CKA	C	IRT	✗	Bloom’s taxonomy	✗	✗	✓	✓	De	✗
[74]	2019	LA	C	IRT	✗	Predefined tables	✓	✗	✗	✓	En	✓
[78]	2016	CKA	D	Oth.	✗	FOL	✓	✓	-	-	En	✓
[81]	2019	CKA	C	CTT	✓	-	-	✗	✓	✓	En	✓
[88]	2020	LA	C	Oth.	✓	-	-	✗	✗	✓	En	✓
[89]	2017	CKA	D	Oth.	✗	Knowledge graph	✓	✓	-	-	En	✗
[94]	2016	CKA	C	IRT	✗	Images, tables	✓	✓	-	-	De	✓
[97]	2016	LA	C	CTT, Oth.	✗	JACET8000	✓	✓	-	-	En	✗
[99]	2017	LA	D	CTT	✗	JACET8000	✓	✓	-	-	En	✗
[98]	2020	LA	D	CTT	✗	JACET8000	✓	✓	-	-	En	✓
[102]	2020	CKA	C	CTT	✗	Q-matrix, test logs	✓	✓	-	-	En	✗
[104]	2017	LA	C	IRT	✗	Brown corpus	✓	✓	-	-	En	✓
[107]	2015	CKA	C	Oth.	✗	Knowledge graph	✓	✓	-	-	En	✗
[113]	2020	CKA	C	CTT	✗	Response times	✗	✓	-	-	En	✓
[114]	2019	CKA	C	CTT	✓	-	-	✗	✓	✓	En	✓
[115]	2020	CKA	C	CTT	✓	-	-	✗	✓	✓	En	✗
[116]	2018	LA	D	Oth.	✓	Predefined tables	✓	✗	✗	✓	En	✓
[119]	2019	CKA	C	CTT	✗	Q-matrix, images	✗	✓	-	-	En	✗
[122]	2020	CKA	D	CTT	✓	-	-	✗	✓	✗	En	✓

Paper	Brief description of the approach to QDET
[6]	SVM that uses 70 features related to (i) the difficulty of the text passage, (ii) the difficulty of the target word, and (iii) test parameters.
[10]	Five features are computed from the question, the passage, and the relation between the two. They are normalized, averaged, and compared to a threshold.
[23]	Found that character length and corpus frequency significantly correlate with vocabulary difficulty.
[26]	Word2vec embeddings given as input to a SVM regressor.
[31]	Built for cloze items; Shannon’s entropy is used to assign a score to each gap based on the number of valid words that could fill the gap given the context (candidates obtained with a 5-gram language model); the score is considered as question difficulty.
[44]	Linear regression model that uses as features mean token length and mean sentence length.
[47]	Reading difficulty directly considered as an indication of question difficulty.
[48]	Words are converted to word embeddings with word2vec; sentences are then embedded with a sentence CNN; an attention mechanism is used to detect the relevant parts of the reading passage; last the estimation is done with an FCNN. Built for reading comprehension questions.
[59]	SVM that uses 59 features (reduced from the 70 in Reference [6]).
[61]	Input words are converted to word embeddings with word2vec, the sequences of word embeddings are then embedded with LSTM, and the final estimation is done with an FCNN.
[67]	Built for listening comprehension questions; uses 339 raw features obtained from the text (written and spoken) using TextEvaluator as input to a random forest.
[74]	Ridge regression, using 36 features from the gap and the context (works on Cued Gap-Filling Items).
[88]	Wighted softmax model that uses as features: word length, log-likelihood from a character-level language model, and Fisher score.
[97]	10 features from target word, reading passage, correct answer, and distractors.
[99]	Features are reading passage difficulty, similarity between correct answer and distractors, and distractor word difficulty. Features can be low or high and the number of “low” features is the difficulty.
[98]	Features are target word difficulty, similarity between correct answer and distractors, and distractor word difficulty. Features can be low or high and the number of “low” features is the difficulty.
[104]	Linear regression model that uses as features 25 linguistic variables at passage and item level.
[116]	SVM that uses as features: word length, word frequency, utilization on the web, Age-of acquisition, concreteness rating, number of POS tags, most frequent POS tag, word2vec embeddings, number of double consonants, number of vowels, presence of shorter homophones.

Paper	Brief description of the approach to QDET
[8]	TF-IDF, linguistic features, and readability measures as input to a random forest.
[9]	TF-IDF features as input to a random forest.
[21]	End to end neural network made of an LSTM component and an attention-based component.
[27]	Features from Coh-Metrix (grouped in narrativity, syntactic symplicity, word concreteness, referential cohesion, and deep cohesion) as input to a linear regression model.
[29]	Difficulty defined as the average popularity of the entities in the question, which is computed using a (required) additional knowledge graph. Text is used only for Named Entity Recognition.
[30]	Works on questions with images. It uses (i) ResNet to extract image representations, (ii) BERT for embedding textual content, (iii) a capsule neural network to obtain a fixed-length array that represents each question, and (iv) a Bayesian inference-based softmax regression classifier to perform the numerical estimation.
[45]	SVM that uses as features the cosine smilarity between the word2vec embeddings of the stem, the correct choice, and the distractors.
[55]	Difficulty obtained from the confidence and selectivity of the question, which are computed from a (required) additional knowledge graph. Text is used only for Named Entity Recognition.
[72]	Difficulty is defined as the variance in the text of students’ answers.
[78]	Manually curated table that maps from specific feature values to question difficulty, using features from text and FOL formulas.
[81]	Two neural networks estimate two components of question difficulty (recall difficulty and confusion difficulty) that are averaged to obtain the difficulty.
[89]	Logistic regression model that uses 15 features related to (i) entity salience (a proxy of entity popularity) and (ii) coherence of entity pairs (captures their tendency to appear in the same context).
[94]	Studies how the presence of images, tables, formulas, and some textual features (text length, presence of specialist terms and abstract concepts) affect item difficulty.
[102]	Pre-trained BERT to embed questions, and TextCNN to perform QDE.
[107]	Difficulty is defined as the similarity between the correct choice and the distractors.
[113]	Uses pre-trained ELMo embeddings, followed by an encoding layer made of a BiLSTM and a dense layer to convert the feature vectors to the target values. It is trained first on response time prediction and subsequently on QDET.
[114]	Random Forest that uses as features: word embeddings (word2vec, ELMo), linguistic features, Information Retrieval-based features.
[115]	Same as Reference [114].
[119]	End-to-end neural network built for heterogeneous questions. It is made of (i) an embedding layer for heterogeneous content (word2vec for texts, convolutional layers for images, fully connected layers for metadata), (ii) BiLSTM, (iii) self-attention, and (iv) max pooling to obtain pre-trained question representations. Fine-tuning on QDE with a fully connected neural network.
[122]	Multi-task BERT.

A Survey on Recent Approaches to Question Difficulty Estimation from Text

ACM Computing Surveys

Abstract

1 INTRODUCTION

2 RESEARCH METHOD

3 THEORETICAL BACKGROUND: THEORIES OF TESTING

3.1 Classical Test Theory (CTT)

3.2 Item Response Theory (IRT)

3.3 Manual Definition

4 TAXONOMY

4.1 Language Assessment (LA)

4.1.1 Comprehension Questions.

4.1.2 Knowledge Questions.

4.1.3 Others.

4.2 Content Knowledge Assessment (CKA)

5 LANGUAGE ASSESSMENT

5.1 Reading Comprehension Questions

5.2 Listening Comprehension Questions

5.3 Single Word Knowledge Questions

5.4 Sentence Knowledge Questions

5.4.1 Reduced Redundancy Testing.

5.4.2 Grammar Questions.

5.4.3 Vocabulary.

5.5 Elicited Speech and Dictation Items

6 CONTENT KNOWLEDGE ASSESSMENT

6.1 Text Only Questions

6.1.1 Models that Leverage Question Text Only.

6.1.2 Models that Leverage Additional Texts.

6.1.3 Models that Leverage Knowledge Graphs as Additional Information.

6.1.4 Models that Leverage Students’ Interactions and Knowledge Components.

6.1.5 Models that Leverage Response Times.

6.2 Heterogeneous Questions

6.3 Others

7 FURTHER ANALYSIS

7.1 Features and Machine Learning Algorithm

7.2 Evaluation and Reproducibility

7.3 Learning Theories and Difficulty Format

7.4 Natural Language

7.5 Publications per Year and Venue

8 DISCUSSION AND CONCLUSIONS

Footnotes

REFERENCES

Cited By

Index Terms

Recommendations

Question Difficulty Estimation with Directional Modality Association in Video Question Answering

Question difficulty estimation via enhanced directional modality association transformer

Comparing between estimation approaches: admissible and dominating linear estimators

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media