Quantitatively ranking incorrect responses to multiple-choice questions using item response theory

Research-based assessment instruments (RBAIs) are ubiquitous throughout both physics instruction and physics education research. The vast majority of analyses involving student responses to RBAI questions have focused on whether or not a student selects correct answers and using correctness to measure growth. This approach often undervalues the rich information that may be obtained by examining students' particular choices of incorrect answers. In the present study, we aim to reveal some of this valuable information by quantitatively determining the relative correctness of various incorrect responses. To accomplish this, we propose an assumption that allow us to define relative correctness: students who have a high understanding of Newtonian physics are likely to answer more questions correctly and also more likely to choose better incorrect responses, than students who have a low understanding. Analyses using item response theory align with this assumption, and Bock's nominal response model allows us to uniquely rank each incorrect response. We present results from over 7,000 students' responses to the Force and Motion Conceptual Evaluation.


I. INTRODUCTION
Many instructional and research questions over the past three decades have been answered by examining student responses to multiple-choice Research-based Assessment Instruments [6] and [1]). A common factor throughout most of these analyses is that students' responses are typically scored as being correct or incorrect; very little attention has been paid to which incorrect answers students choose. This dichotomous scoring scheme is very beneficial for simplifying student performance on a RBAI or growth in learning to a single number that may be compared between students or across populations.
The simplicity of this analysis and the ability for instructors and researchers to compare their results with other data sets has contributed to the proliferation of RBAIs, to the benefit of the physics education research (PER) community; however, the dichotomous scoring scheme implicitly ignores any information about students' choices that are not correct. All incorrect answers are treated equally, regardless of how similar or different they may be to the correct answer.
RBAIs are so powerful because their questions help to elicit students' core beliefs about how the world works in ways that mathematical or problem-solving questions often do not.
Many of the incorrect "distractor" response choices correspond with deeply held intuitive understandings that fit well with everyday experiences (and correspond with historically accurate models) but conflict with the principles of Newtonian physics [1]. The ability to deeply probe students' conceptual understanding of physics and represent this understanding with a single numerical value is very powerful. The authors of the FMCE, in fact, argue against using a single numerical score to represent student understanding [7] instead favoring the examination of student performance on individual or small groups of questions [4], but the common practice persists. Moreover, the common practice of reporting normalized gain as a measure of student learning has been shown to be biased against students with little prior exposure to formal physics instruction [8].
Other analyses of student responses to RBAI questions examine specific choices that students make and relate these choices to various mental models [9], misconceptions [3], views [10], or pieces of knowledge [11,12] that students may have or use when answering particular questions. These analyses provide a lot of rich information about students' ideas, but the processes of conducting these analyses are often quite time intensive, and the presentation and visualization of the results can be conceptually dense and difficult to interpret [13,14].
As such, these analyses are not nearly as common as reporting a single numeric score.
Our ultimate goal is to define a single numeric score that represents student knowledge or understanding as measured by a RBAI by incorporating both correct responses and the good ideas that may be expressed in some incorrect responses. The first part of defining such a metric is to determine whether or not some incorrect answers may be considered better than others, where "better" means closer to correct or indicating a higher level of understanding.
Considering one incorrect answer to be better than another can be a tricky business, and we want to make sure that we are not introducing personal bias into our definitions. As such, we carefully articulate the assumption for defining what makes one response better than another, and we choose an analysis method that correspond to that assumption to quantitatively rank incorrect responses based on students' response patterns.
Assumption: Students who choose correct responses on most questions are more likely to choose better incorrect answers than students who choose few correct responses.
This assumption is based on the premise that students who understand more about Newtonian physics are more likely to choose better incorrect answers than students who understand less physics, and these students are also more likely to choose a greater number of correct responses. This assumption is consistent with previous work that has used item response curves (IRCs) to examine and rank incorrect responses on both the FCI and the FMCE [15][16][17][18]. We expand on this prior work by using a nested-logit item response theory (IRT) model to simultaneously estimate students' overall understanding of Newtonian mechanics (the IRT latent trait, or person parameter) and determine how closely each response choice correlates with a high level of understanding using the estimated parameters of the model [19][20][21][22][23][24]. Based on this assumption we would claim, for example, that a student who only incorrectly answers one question is more likely to choose a response that's almost correct than a student who answers 20 questions incorrectly.
To illustrate the applicability of our assumption, we analyzed more than 7,000 students' matched pre-/post-test responses to the FMCE to demonstrate how quantitative analyses can provide information about which response choices may be better than others. We present a ranking of incorrect responses for all FMCE questions as well as the parameter values used to make these determinations.

II. DATA SOURCES AND PREPARATION
Our data come from two primary sources: • sets of student responses to the FMCE provided to one author (TIS) by colleagues from four different colleges or universities, Schools 1-4, [25] as part of current and previous research projects (N = 952), and • student responses uploaded to PhysPort's Data Explorer (N = 6, 336) [26].
Some information is known about the instructional settings at Schools 1-4 (all of which used research-based instructional materials of some sort), but this information is not available for the PhysPort data. For the purposes of the current analysis, we combine all data into one set of N = 7, 288 students. We are not interested in how instructional factors impact student learning for this analysis, or whether or not student responses are different before or after instruction. As such, we have combined all pretest and posttest responses into a data set of N = 14, 576 response sets.
To prepare the data for analysis, we omit any responses that are inappropriate for a given question (e.g., a given response of E on question 45, which only includes options A, B, C, and D). We also omit response J (None of these answers is correct) from interpretations of our analyses because it does not represent a well-defined indication of what each student would consider correct: two students who choose answer A agree on what they consider to be correct, but two students who choose answer J may have very different ideas of what would be a correct answer, so we cannot claim similarities between the responses of students who choose J. We also removed response sets with three or more blank or unscorable responses. This gave us a usable data set of N = 12, 388 response sets.
The structure of the FMCE makes it an interesting focus for this work. Unlike many other RBAIs, the FMCE contains several questions for each physical scenario presented (e.g., a toy car moving horizontally), and all questions in each set have the same set of response choices. This is particularly interesting because a response choice that corresponds with the most common intuitive answer to one question, may not relate to any documented reasoning for another question.

III. THE 2PL-NRM NESTED LOGIT MODEL
Item Response Theory (IRT) uses students' responses to multiple-choice questions to simultaneously estimate each student's overall understanding of the material (a.k.a. the latent trait or person parameter, θ) and determine the probability that a student will be correct on each question given his/her understanding [27,28]. The latent trait is normalized such that the average value is θ = 0 and the standard deviation is σ θ = 1. In the two-parameter logistic (2PL) IRT model, the probability of a student answering a specific question correctly is given by, where a is the discrimination parameter and b is the difficulty parameter. Some previous work has used the three-parameter logistic model to analyze RBAI data [29], but we feel that the inclusion of the third "guessing" parameter is inappropriate for our analyses given that student responses to the FMCE are concentrated in a small subset of responses for each question: they are not, in fact, guessing [7,11].
The interpretation of the parameters in the 2PL model may be understood by examining plots of P (θ) vs. θ: Fig. 1 shows examples from several questions. The difficulty b is the value of θ at which P (b) = 0.5), and the discrimination a is proportional to the slope of the curve at θ = b: dP/dθ| b = a/4. Questions 1 and 14 ( Fig. 1 (a) and (b), respectively) have similar difficulty parameters (the b value differs by less than 0.1 standard deviations of the latent ability θ), but Q14's higher discrimination parameter a shows up as a sharper transition from most likely incorrect to most likely correct, and a steeper slope at the midpoint of the curve. Question 22 in Fig. 1(c) has a similar discrimination to Q1 (similar slope at P (θ) = 0.5), but the difficulty is much lower (shown by a shift to the left compared to Fig. 1(a)), with many below-average students (θ < 0) being fairly likely to answer correctly.
Question 47 in Fig. 1(d) has a difficulty parameter that is about average (close to zero), but the discrimination is relatively small, as shown by a shallow slope, and a more gradual , where k indicates the particular response choice, and the summation is performed over all N response choices. According to Bock and Moustaki, the value of the a k parameter may be used to rank the incorrect responses, with a higher value indicating a response that is more closely correlated with the latent trait and, therefore, better than a response with a lower value [30]; however, the meaning of the a k and b k parameters is not as easily interpreted as the a and b of the 2PL model [20]. One shortcoming of the NRM is that the parameters are not uniquely defined, and a normalization constraint is required. This is often accomplished by setting the value of both parameters associated with one particular response to be fixed (at 0 or 1) and determining all other parameters relative to those. The NRM is excellent for analyzing data for which no prior information is available regarding the relative correctness of any of the responses; however, we have found that for our FMCE data, the parameters occasionally become reversed with choosing the correct response being associated with having a low value of θ (i.e., a poor overall understanding of Newtonian mechanics).

IV. RANKING INCORRECT RESPONSES
In order to rank incorrect responses while properly accounting for the correct response, we use the 2PL-NRM nested logit model developed by Suh and Bolt [21]. In this model, the probability of a student choosing a specific incorrect response k is given by where the parenthetical term is the probability of being not correct from the 2PL model, and the second term is Bock's NRM with the summation being over only the incorrect responses.
In this model, the values of all θ, a, and b parameters are calculated using the 2PL model, and all a k and b k parameters are determined using the NRM, given the 2PL results. We used the A Multidimensional Item Response Theory (mirt) package in the R programming language to perform all IRT analyses [22,23,31].
To determine the ranking of incorrect responses, we calculated the values of a and b for every question, and a k and b k for each incorrect response choice. According to de Ayala, a data set must have at least 10 times as many response sets as the number of parameters to be calculated for an IRT model to have good convergence [28]; with our data set of N = 12, 388 response sets we are more than able to determine the 722 necessary parameters.
We choose to omit response J from our ranking of incorrect responses because it does not correspond to a unique choice that students make about what they think is correct. Two students who choose response A (for example) agree that the information associated with A is correct, but two students who choose response J may or may not agree with each other regarding what a correct response would be. As such, we do not think it is valid to suggest that choosing J represents a unique level of correctness.
As a result of using the mirt package to apply the 2PL-NRM nested logit model, every response choice within each question has a unique a k value, implying that all answers are meaningfully different from each other. The question is then whether or not any of the a k values may be considered approximately equal to others, indicating approximately equal correlations with the θ parameter (i.e., response choices that are equally correct). To determine whether or not response choices are different from each other, we calculated the sampling distribution of values for each a k parameter by selecting random sample of 7,300 respondents using the sample function in R, and we used the mirt package to calculate each parameter [32]. We repeated this process over 100,000 times to create a set of values for each parameter.
The mirt package uniquely determines each value of a k by setting one parameter equal to 1 for each question [23]. In order to ensure that we obtained a distribution of values for all parameters of interest, we chose to include one set of responses that included a "dummy" response and set a 0 = 0 for this response. All other a k parameters are determined relative to a 0 ; as such, the a k values are only meaningful when compared within the same question, and there are no thresholds for determining whether a particular value of a k is high or low in and of itself.
Using the effsize package, we calculated a Hedges' g effect size to quantify the magnitude of the difference between the a k values for each pair [33]. In our full data set, every response to every question is selected in at least one response set. The a k values reported in Tables   I and II are   There are several key features to notice about these distributions: • there is no distribution for the correct response because the 2PL-NRM does not calculate an a k parameter for the correct response; it is automatically assumed to be the best, • the distribution for a A is higher than any other value of a k , with only minimal overlap with a E , indicating that A has the highest parameter, and is thus the best incorrect response, and • the distributions a C and a G are practically identical, indicating that these parameters have very similar values; thus, we would interpret them as being equally correct.
Other comparisons between various responses are a bit more ambiguous. The a D and a F distributions look quite similar, but not as similar as a C and a G . The a H distribution is noticeably shifted to the left of a C and a G , but there is still quite a bit of overlap. We use Hedges' g to quantify the magnitude of the difference between each pair of distributions of the a k values: if g is small (g < 0.5) we conclude that the parameter values are effectively equal and the responses are equally correct, if g is very large (g ≥ 1. 3) we conclude that the parameters are significantly different and that the responses represent different levels of understanding, and if 0.5 ≤ g < 1.3 then we cannot make a conclusive determination. Tables I and II show the IRT ranking results for each question on the FMCE, including the a k value for each response, the value of Hedges' g for each nearest-neighbor comparison, response, the value of Hedges' g used to determine the ranking between each pair of parameters, and the percent of the data set that chose each response (rounded to the nearest percent).                      a graph with a constant negative value (zero slope). According to these results, the best incorrect answer is A: a graph with a constant positive value (zero slope). The second-best incorrect response is E: a graph with a constant zero value (zero slope). All other responses are graphs with nonzero slope. This suggests that realizing that a constant acceleration indicates a constant force is indicative of an above-average understanding of basic Newtonian mechanics. This result alone may not be revolutionary to anyone who has taught introductory mechanics, but the implication that claiming that zero force is required to make an object speed up is a better answer than selecting a graph showing a changing force may be more surprising. Response E is chosen by fewer than 1% of the data set, but these students seem to otherwise have a fairly strong understanding of Newton's laws as measured by the FMCE. Table III shows IRT rankings that have been filtered to only include responses given by at least 1.00% of the population. For some questions (such as 1, 16, and 22) the difference is quite stark, with only the correct and one incorrect answer choice remaining. For many of the questions (such as 2, 14, and 26) the rankings remain the same, but many of the responses that are seen as equivalent to others (or ambiguously ranked) have been eliminated.
Moreover, a smaller fraction of the rankings in Table III are "≥" as compared to Tables I   and II (29% vs. 35%), and a greater fraction of rankings are ">" (56% vs. 36%). This suggests that many of the ambiguities in rankings may be attributed to the relatively low probability of choosing those responses at all levels of physics understanding, which could result in relatively broad distributions of parameter values generated by randomly selecting subsets of data.

V. RELATING RESPONSE RANKINGS TO IRT PLOTS
We can use plots of the IRT curves to better understand these rankings. Figure 3 shows the 2PL-NRM curves for every response to several question. (The 2PL-NRM IRT plots for  while response C is mostly chosen by students with below-average understanding and is more and more likely with lower values of θ. This is consistent with the ranking in Table   III with H being a better response than the most common A, and C being a worse response. We can also see in Fig. 3(c) that responses B and C on Q14 have a similar shape, with the highest probability of choosing each being at the low end of the θ-axis; this is consistent with these responses being considered equivalent in Table III Fig. 3(e) is very similar to Q14 in that there is a single most-common-incorrect response (D), a better response (A) that has a relatively narrow symmetric probability distribution centered around θ ≈ b > 0, and some worse responses (C, G, H) that have their highest probabilities at the low end of the θ-axis. The major differences between the plots for Q19 and Q14 are that the most-common-incorrect response has a much broader range of values in the θ < 0 regime for Q19. In fact, response D is approximately equally probable to responses C, G, and H around θ ≈ −3. Moreover, the probability curves for the equivalent responses C and G are basically on top of each other for the entire range of θ. This is strong evidence that students choose these responses in roughly the same proportions, and it may indicate that students are choosing them for the same reasons. Response H also has similar probabilities to C and G, but has a distinctly more negative slope than either, indicating a greater likelihood of being chosen by students with lower θ values.
The plot for question 18 in Fig. 3(d) is even more complex than that of Q14, but there are several similarities between them. Once again there is a single dominant incorrect response (H), but it is less dominant than the most common response to Q1 or Q14. This is largely due to G and D being equivalent to H (according to Tables I and III), and all three sharing a similar shape in which the probability is relatively uniform for θ < 0 (but has a slight peak around −1 < θ < −0.5) and decreases to zero for 0 < θ < 2. Responses D and G each have probabilities between about 0.1 and 0.15 for θ < 0, which accounts for the probability of the most common response H never rising above 0.75. Also in Fig. 3(d) we can see a small bump in probability for the better-than-common response A around θ ≈ 0.75, and the worst responses (C and F) are most common at the lowest values of θ. Questions 14, 18, and 19 all indicate that students with above-average understanding (θ > 0) are more likely to choose both the correct response and the highest-ranked incorrect response than students with θ < 0.
The 2PL-NRM plot for question 47 in Fig. 3(f) shows an example of a question on the FMCE for which there is not a single dominant incorrect response. Once again we can see the ranking from Table III in the shape of the curves: A is correct, D and B have similar shapes with maximum probabilities around θ ≈ −1.5, and C with the highest probability at the low end of the θ-axis with a distinctly negative slope. What makes Q47 really interesting is that none of the responses is uniformly zero over a broad range of θ values. All incorrect responses have probabilities above 0.15 for −3 < θ < −1, and the probabilities for two of the incorrect responses (B and C) are roughly equal to the correct response around θ ≈ −0.5 (with response D only about 0.10 lower). The relatively high probabilities of all incorrect answer choices on Q47 may contribute to the low value of the discrimination parameter a (see Fig. 1).
Question 8 (Fig. 3(b)) shows an interesting example of a case where there are multiple responses (B, C, D, and E) ranked higher than the most common incorrect response G. Only response F is ranked lower than G (with g ≈ 16). On the FMCE, question 8 is the first in a set of three that asks students about the forces on a toy car as it rolls up and down a ramp.
In question 8, the car is moving up and slowing down: response A is correct that the net force on the car is constant and down the ramp, response G is that the force is up the ramp and decreasing, while response F is that the force is up the ramp and increasing. Responses B and C both indicate a force down the ramp (increasing and decreasing, respectively), response D indicates zero net force, and response E is that the force is up the ramp and constant. All of these better-than-most-common responses agree with the correct answer in one way (either the direction of the force or the fact that it is constant), while G and F have nothing in common with the correct response.
From a visual perspective, the approximate value of θ for which the probability curve for a particular response is maximized indicates the ranking for that response [37]. The curve for the correct answer is always a monotonically increasing function of θ; answers that are better than a most-common incorrect answer (like H on Q14 in Fig. 3(c) and A on Q18 in Fig. 3(f)) have relatively narrow probability distributions with peaks around θ = b; the worst response in each rank has its highest value at the low end of the the displayed θ-axis with a distinctly negative slope, indicating that lower values of θ would yield even higher probabilities of choosing those responses. Plots of IRT probability curves for all FMCE questions are included in Figs. 6-8 below.

VI. ANOMALOUS RESULTS
We used Rowan University's high-performance computing cluster to generate the distribution of values for each parameter. In examining the results, we noticed some anomalous values for each parameter as shown by the small bumps on the right side of Fig. 4. These bumps represent about 1% of the results. Figure 5 shows an enlarged version of the smaller distribution [38].

VII. SUMMARY AND FUTURE DIRECTIONS
The 2PL-NRM nested logit IRT model may be used to rank incorrect responses to FMCE questions by considering the calculated a k value as a measure of the correlation between choosing a particular response and the value of the θ parameter representing overall understanding of Newtonian mechanics (as measured by the FMCE). We have shown that using random samples of a large data set can generate distributions of values for each a k that allow us to determine whether or not these parameters are meaningfully different, and we used Hedges' g as a statistical measure to quantify these differences. We made particular choices regarding the values of g that we consider to represent parameters that are approximately equal, those that are definitely different, and those that could go either way, and we have reported the value of g for all comparisons to allow the reader to evaluate the validity of our claims or determine a ranking based on other choices of thresholds.
In many cases the responses that could not be determined to be definitely different or approximately the same as others are those that are rarely chosen by students. Future research will focus on clarifying these comparisons and trying to determine a robust ranking for all responses to every question. One way to accomplish this will be to perform similar analyses on other large data sets. Online data collection and analysis tools such as PhysPort's Data Explorer [26] and the Learning About STEM Student Outcomes (LASSO) [39] make this task much more achievable than it would been even a decade ago.
The results presented here are based on the assumption that students who understand more about physics will answer more questions correctly on the FMCE, and will also select better incorrect responses than students who understand less overall physics. We also ignored whether data were collected before or after instruction: we didn't care how students obtained their understanding, just that they had some level of understanding when they chose their responses. In future work, we will explore the implications of different assumptions for what makes one response better than another.
One such assumption is that students will choose better responses after instruction than before instruction. This assumption is supported by the fact that (on average) students are more likely to choose correct responses after instruction than before: even low class-averaged gains from traditional instruction tend to be positive [2,7,40]. A method consistent with this assumption would be to look for asymmetric transitions between response choices in matched pre/posttest data using a McNemar-Bowker chi-square test [41,42].
Another assumption that could be made is that students are more likely to choose correct responses after instruction if they chose better incorrect responses before instruction. Using conditional probabilities would allow us to identify a progression of responses to each question, with students moving up the progression being considered getting closer to correct. This is consistent with Thornton's conceptual dynamics in which students move between various views as they progress toward the correct response [10].
Each of these methods could be used to test the rankings presented above and help clarify the ambiguously ranked responses. These methods and assumptions may also be applied to other research-based assessments to determine a robust ranking of incorrect responses for any multiple-choice question.  Tables I and II relate to the probability of choosing each response, given the value of a student's θ parameter.
Each curve is labeled near the maximum value (with slight adjustments to avoid overlapping labels and curves), so the horizontal location of the label provides an approximate ranking of the responses (right is better, left is worse). Ambiguously ranked responses often show up as lines near zero probability for all values of θ.
[1] Adrian Madsen, Sarah B. McKagan, and Eleanor C. Sayre, "Resource letter RBAI-1: Research-based assessment instruments in physics and astronomy," American Journal of [4] Ronald K. Thornton and David R. Sokoloff, "Assessing student learning of Newton's laws: The force and motion conceptual evaluation and the evaluation of active learning laboratory and lecture curricula," Am. J. Phys. 66, 338-352 (1998).