How to quantify the efficiency of a pedagogical intervention with a single question

In many situations, the change in the conceptual understanding of students is measured using a single question. This is, for instance, the case in peer instruction where students answer twice to the same questions, before and after the discussion phase. Using item response theory and assuming that students proficiencies are normally distributed, it is shown that the Cohen’s d effect size characterizing the change of mean proficiencies can be estimated by taking 0.6 times the log of the odds ratio of class scores. Moreover the polychoric correlation coefficient between students’ answers is suggested as an additional indicator to detect abnormal changes in scores when its value is below 0.3. Taken together, these two indicators give both a precise measurement of a pedagogical intervention—a peer discussion or something else—and a coefficient of security to detect random answers or poor writing of questions. The application is made to the evaluation of peer discussions that took place in an introductory mechanics course taught using peer instruction.


I. INTRODUCTION
Concept inventories are widely used in education research to evaluate changes in conceptual understanding related to a specific intervention [1].In this case they are used twice: the first one before the intervention-generally a full semester course-and the second time after the intervention-at the end of the semester.For instance, the Force Concept Inventory (FCI) [2] evaluates students mastering of Newton' s laws [3].It is composed of 30 multiple-choice questions where incorrect answers are based on the most frequent answers given by students during interviews.Concept inventories are difficult to design [1,4] and their administration takes time and requires some caution [4].
Instructors or researchers can measure the change in conceptual understanding of students due to a pedagogical intervention using only one question.A typical example is the use of Peer Instruction (PI), an evidence-based, interactive teaching method, widely used by science teachers [5,6].A class taught with PI is divided into a series of short presentations, each focused on a central point and followed by a related conceptual question, called a ConcepTest, which probes students understanding of the ideas just presented.Students are first given one or two minutes to formulate individual answers and report them to the instructor using classroom response systems such as clickers.Students then discuss their answers with others sitting around them.A few minutes afterwards, the instructor calls for an end to the discussion and polls students for their answers again, which may have changed based on the discussion.Finally, the instructor explains the answer and moves on to the next topic.In this case, the change of conceptual understanding is measured by the two votes, and the pedagogical intervention is the discussion.Many kinds of interventions can be measured in this way.For instance, instead of a discussion with peers, the instructor can give a hint, or give additional time to think about the question [7].Variations of the instructor's directions for the discussion can also be tested, such as telling students to reach a consensus with their peers [7].In online learning environments, an intervention can be the use of an online discussion board to discuss the question with peers, or to display to each student only a few selected particular rationales written by previous students [8,9].
The efficiency of an intervention can be analyzed qualitatively, for instance, by listening to the student's dialogues [10,11].While this technique is powerful, it is extremely time consuming.Hence it is dedicated to research applications.In this article, the efficiency of an intervention is evaluated through the variation of students' answers between the first vote-before the interventionand the second vote.The analysis leads to two quantitative outputs: the Cohen's d effect size of the learning and a correlation index enabling us to detect a global guessing behavior or poor item formulations.A main advantage of using students answers is that they can be computed automatically, without any human intervention.Our two indicators are based on latent trait modeling, i.e., item response theory, in order to give the most reliable information.
Potential applications of the proposed method include the following: • Teachers who have to create many ConcepTests for their courses and then to select, year after year, which questions to keep and which ones to move or discard.• Researchers who want to test variations of the instructor's instructions in peer instruction quantifying how many times a particular method is better than the others.• Software designers and online-learning teachers wanting to select the best among various strategies.The article is organized as follows: Sections II and III deal with traditional measures based on the variation of students' scores and highlight their limitations, Secs.IV and V introduce item response theory and assumptions on students' conceptual understanding, and, finally, Secs.VI, VII, and VIII introduce how to calculate new indicators of progression of students on conceptual understanding related to the question that are based on item response theory.Section IX presents an application to items used in a course taught using PI.

II. USUAL MEASURES OF DIFFERENCES BETWEEN PROPORTIONS
Pre-and postintervention scores are the proportions of students who answer correctly to the question.Note that p 1 is the proportion of students who answer correctly at the first vote, and p 2 the proportion of students who answer correctly at the second vote, i.e., after the pedagogical intervention.Usual measures to compare the effect of an intervention are the risk difference, the risk ratio, and the odds ratio [12].All those measures are based on a probabilistic point of view and are often used in epidemiology to reduce the risk for people to suffer from a particular disease.
The risk difference is simply the difference in risk (probability) of an event between two groups.In this case, the risk difference is RD ¼ p 2 − p 1 .
The risk ratio, also called the relative risk, is the ratio of two risks.In this case it is given by RR ¼ p 2 =p 1 .For instance, if there are 40% of students who answer correctly at the first vote, and 70% at the second one, the risk ratio is 1.75.It means that students are 1.75 times more likely to have a correct answer after the pedagogical intervention than before it.
Where the risk ratio is the ratio of two risks, the odds ratio is the ratio of two odds.Here, the odds of success after discussion is p 2 =ð1 − p 2 Þ, while the odds of success before discussion is p 1 =ð1 − p 1 Þ.Using the same example as previously, before the intervention, students are 0.67 ¼ 0.4=ð1 − 0.4Þ times more likely to give a correct answer than to give an incorrect one.After the intervention, they are 2.3 ¼ 0.7=ð1 − 0.7Þ times more likely to give a correct answer than to give an incorrect one.The ratio of the two odds is defined by OR ¼ p 2 =ð1 − p 2 Þ × ð1 − p 1 Þ=p 1 .In our example, it is equal to 2.3=0.67 ¼ 3.5.

III. WHAT'S WRONG WITH THOSE MEASURES?
The risk difference is not an interval scale [13]: the signifiance of a particular value of RD depends on the initial prescore.For instance, a RD of 10% does not have the same signifiance whether the initial score is 10%, 50%, or 90%.Hence the RD cannot be a correct indicator to compare the efficiency of an intervention independently of the initial score.
The risk ratio suffers from the same problem.A RR of 1.5 does not have the same meaning whether the initial score is 10%, 50%, or 90%-it is even impossible to get a RR of 1.5 if the initial score is 90%.
The odds ratio is also not an interval scale.However, as it will be shown in the next sections, its logarithm value is, under some assumptions that will be detailed later, an interval scale.
RD, RR, and OR can be used to classify two interventions starting with the same initial prescore.For instance, if the interventions A and B start both with an initial score p 1 ¼ 40% but intervention A leads to an increase of 20% while intervention B leads to an increase of 10%, it can be concluded that intervention A is better than the intervention B. However, these measures cannot be used • to compare two interventions starting with two different initial scores.• or to quantify how many times an intervention is greater than another, even if they both start with the same initial score.For instance, while intervention A leads to an increase of 20% and intervention B leads to an increase of 10%, it cannot be concluded that intervention A is twice as good as intervention B. In the following sections, item response theory will be used to build a quantitative indicator that could overcome these limitations.

IV. ITEM RESPONSE THEORY
Item response theory (IRT) belongs to the family of latent trait modeling [14].In those models, each student is described by a number of latent traits, also called proficiencies.The answer of a student to a question is thought of as the result of the interaction between the capabilities of the person taking the test and the characteristics of the test items.The score of a student to an item is modeled by a probabilistic function of one's proficiencies and some item's characteristics.A consequent amount of knowledge and skills are always necessary to give a correct answer [15] but in many cases, only one proficiency is sufficient to determine the student score.This is call unidimensional item response theory, often simply called IRT.
In unidimensional IRT, students are described by a single continuous unbounded variable, called the proficiency.One claim of IRT is that this proficiency is an interval scale.This means that an equal difference in proficiency always has the same signifiance, independently of the initial proficiency.Hence this scale could be used to measure the efficiency of a pedagogical intervention, whether the initial score to the item is 10%, 50%, 90% or any value.It can also be used to calculate how much one intervention is better than another.
The aim of a question is to test students' understanding of a particular concept.Note that θ is the proficiency of a student which is measured by the question.A greater value of θ means a greater understanding of the associated concept.While the understanding of a concept is sometimes thought of as a binary variable, IRT assumes that it is a continuous scale.Application and validity of IRT to physics questions have been illustrated by analyzing score patterns of students to the Force Concept Inventory [16][17][18][19], the Mechanics Baseline Test [20], the Force and Motion Conceptual Evaluation [21], the Brief Electricity and Magnetism Assessment [22,23], and the Continuous Time Signals and Systems Concept Inventory [24].
In IRT, an item is modeled by a function PðθÞ, which describes the probability of a student with proficiency θ to give the correct answer to the item.The P function, called the item characteristic curve, is often assumed to be a generic "S-shape" function, called a logistic function, whose form characterizes each question.In this work, the P function is assumed to follow the two-parameter logistic item model, called the 2PL model: where a and b are parameters of the item: a is its discrimination power, and b its difficulty.Usually these parameters are estimated using statistical techniques on a large pool of students' answers on many items.For instance, Wang et al. [17] use pattern responses of 2800 students on the 30 FCI items.In this case the proficiency is what is commonly measured by all the items [15].Apart from the two-parameter model chosen here, standard models used to describe the P function are the Rasch model-also called the 1PL model-or the 3PL model.Descriptions of these models and our reasons for the selection of the 2PL model are discussed in Appendix A.
In the framework of the latent trait modeling, an item is seen as an imperfect measuring tool of proficiency.The proficiency is a latent variable because it is not directly observed.What is observed is the answers of the students to the item.Note X a particular answer which could be true-if the student answers correctly-or false-if the student answers incorrectly.Hence X is a dichotomous categorical variable.Latent model is written as [14] where ϵ is the measurement error.This measurement error has a null mean and a standard deviation equal to 1=a.The item difficulty is seen as a threshold value and the discrimination coefficient represents the quality of the measure.In the case of a null measurement error, the logistic item characteristic curve Eq. ( 1) reduces to an Heaviside function centered in θ ¼ b.In this case the student gives a correct answer if and only if its proficiency is greater than the difficulty of the item.

V. DISTRIBUTION OF STUDENTS PROFICIENCIES
The proportion of correct answers to a question depends on the distribution of the students' proficiencies.In latent trait modeling and IRT, it is often assumed that this latter is normally distributed.This assumption will be used in the following, i.e., θ ∼ N ð θ; σ 2 θ Þ.Average proficiency θ and standard deviation σ θ are unknown.However, using a logistic approximation to the cumulative normal distribution [25], they can be related to the proportion of correct answers p (see Appendix B): The ratio between σ Y and σ θ depends on the measurement error of the item.Using data from Lasry et al. [26], this ratio was estimated to 1.09 (see Appendix C).Hence this ratio will be assumed to be equal to 1, i.e., measurement errors have a negligible effect.As a consequence, the parameters of the distribution of the FIG. 1.The logit function.Vertical abscissa is 0.6 ln½p=ð1 − pÞ, where p is given by the horizontal abscissa.
students' proficiencies are linked to the proportion of correct answers by The right-hand side is 0.6 times the logit of p defined by logitðpÞ ¼ ln½p=ð1 − pÞ.The logit function transforms a probability between 0 and 1 in a value between −∞ and þ∞ (see Fig. 1).

VI. EFFECT SIZE
Let us go back to our initial problem.Before the pedagogical intervention, the proficiencies of students are noted θ 1 and assumed to be normally distributed: After the intervention, their proficiencies are θ 2 and are also assumed to be normally distributed: Applying Eq. ( 4) to pre-and postscores leads to The ratio σ 2 =σ 1 is unknown.However, estimations show that it can be assumed to be close to 1 (see Appendix D).Hence, assuming σ 2 ≃ σ 1 ¼ σ, Eq. ( 5) gives the Cohen's d effect size [27]: Equation ( 6) estimates the difference of means of two normal distributions using the areas under the curves that are beyond a threshold value, as illustrated in Fig. 2. The horizontal axis represents students' proficiency.The Gaussian is the proportion of students with a given proficiency.The vertical dotted line corresponds to the difficulty of the item.Students with a higher skill level than the item's difficulty correctly answered the question.The proportion of correct answers is therefore represented by the gray area on the right of the vertical dotted line.In the second vote, the distribution of students' proficiency shifted to the right, increasing the proportion of students with higher proficiency than the item's difficulty.Equation ( 6) therefore makes it possible to evaluate the offset between the two distributions from the areas under the respective curves.The Cohen's d effect size is perhaps the most commonly used effect size metric and is broadly used in education research and many other fields.Estimating the change of scores in terms of the Cohen's d effect size allows us to use the rules of thumb for interpreting the efficiency of the intervention in terms of very small to huge [27,28].Figure 3 plot isovalues of the effect size corresponding to these degrees in the ðp 1 ; p 2 Þ plot, enabling us to have a quick estimate of the value of the effect size of an intervention.Moreover, effect size enables comparisons to other teaching methods [29] and should be used rather than other methods such as the Hake's g [30].

VII. TETRACHORIC CORRELATION COEFFICIENT
The correlation coefficient between the students' answers before and after the intervention can enlighten us on the nature of the votes.For instance, if all students have the same increase of proficiency, the correlation coefficient between θ1 and θ2 is equal to 1.However, one can expect that depending on various inhomogeneities, this increase of proficiency is not the same for all students, but fluctuations should remain small.On the opposite side, a low correlation coefficient between answers at the first and the second vote should warn us that something wrong could have happened.Maybe the students have voted randomly at one of the two votes-or both of them-or maybe the measurement errors are huge (see Appendix E), perhaps due to a poor writing of the question.As a consequence, a correlation coefficient close to 1 is a good thing-due to measurement errors one cannot expect a coefficient greater than 0.85 (see Appendix C)-and a small correlation coefficient indicates that the results of the votes should be interpreted cautiously.Section IX shows that a value of 0.3 can be used as a rough threshold.
The traditional Pearson correlation coefficient cannot be used with students' answers because they are categorical variables (either true or false) and not continuous ones.The tetrachoric correlation coefficient has been especially developed to deal with categorical data explained by latent variables [31].It is a product-moment correlation between two unobserved quantitative variables that have each been measured on a dichotomous scale.It assumes that the contingency table of the observed variables, here X 1 and X 2 , comes from two correlated random variables that are normally distributed, here Y 1 and Y 2 , where The thresholds and the correlation coefficient between Y 1 and Y 2 are estimated using the maximum likelihood (ML) technique [31].
The statistical software R includes a dedicated library named polychor to estimate this correlation coefficient perform using the ML method.However, depending on the purpose, an estimate of this correlation coefficient can be sufficient and is obtained from the contingency table [32]: where N 1 , N 2 , N 3 , and N 4 are the components of the contingency table (cf.Table I).Their sum is equal to the total number of students:

VIII. CONFIDENCE INTERVALS
The evaluation of d and ρ using Eqs.( 6) and ( 7) are made from the observed proportions of correct answers and the corresponding contingency table.Hence they are estimators of the true values based on the theoretical proportions obtained only for an infinite number of students.Suppose that they are calculated using a sample of 10 students.It is clear that the value obtained will be a poor indication of the real effect size of your pedagogical intervention.For research applications, such as determining which one of two methods leads to the greatest effect size or if an intervention has a non-null effect size, confidence intervals are needed.A confidence interval is an interval that might contain the true value of the estimated parameter that would have been obtained with an infinite number of students drawn from a theoretical distribution-in our case the Gaussian distributions of students' proficiencies (shown in Fig. 2).Confidence intervals are given with a given confidence level.For instance, a 95% confidence interval has a 95% chance to contain the true value.This means that if the same experiment-first vote, pedagogical intervention, and second vote-was repeated on numerous samples of students, the fraction of calculated confidence intervals (which would differ for each sample) that encompass the true population parameter-d or ρ-would tend toward 95%.The greater the confidence level is, the wider the confidence interval.A 95% confidence interval is included in the 99% confidence interval calculated from the same sample.
For a given number of students N, the observed proportions of correct answers p obs 1 and p obs 2 follow approximative normal distribution laws of means p i and variances p i ð1 − p i Þ=N (due to the normal approximation of the binomial distribution when N ≫ 1).Assuming that the standard deviation of p obs i remains law behind p i , the logit function can be linearized around p i .Hence the observed logit function L obs i of p obs i follows an approximative normal distribution of mean logitðp i Þ and variance: The observed effect size also follows an approximative normal distribution.Hence a 95% confidence interval of d can be estimated by where σ d is the standard deviation of d, estimated using the observed values p obs i : The correlation coefficient ρ L between L obs 1 and L obs 2 is unknown.Numerical simulations were performed in order to estimate it.For a given set of values of the number of students N, the correlation coefficient ρ, a proportion p 1 , and an effect size d-or, equivalently, a proportion p 2 -, N samples were drawn from the bivariate normal distribution ðθ 1 ; θ 2 Þ assuming equal variances, a correlation coefficient ρ and expected values given by Eq. ( 4)-the question difficulty b was arbitrarily set to 0. Then the observed students' answers X 1 and X 2 and the corresponding contingency table were calculated using Eq. ( 2) assuming a null measurement error.Those steps were repeated 10 000 times in order to calculate the correlation coefficient ρ L .In order to cover a wide range of all possible parameters, this process was repeated for N ¼ 200 and 1600; ρ ¼ 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9; p 1 ¼ 0.3, 0.4, 0.5, 0.6, and 0.7; and d ¼ 0.2, 0.5, and 1.2, leading to a total of 10 000 × 210 simulations performed.Results show that the correlation coefficient ρ L between L obs 1 and L obs 2 is very close to the correlation coefficient ρ p between p obs 1 and p obs 2 : the absolute difference between these two correlation coefficients is less than 0.005.Moreover, the correlation coefficient ρ p is given by the traditional Pearson correlation coefficient between X 1 and X 2 -assuming a value of 0 for an incorrect answer and 1 for a correct answer.Hence, ρ L is given by where N 4 is a component of the contingency table (cf.Table I).When calculating the confidence interval with Eq. ( 9), σ L 1 , σ L 2 , and ρ L are evaluated using the observed values p obs 1 and p obs 2 in Eqs. ( 8) and (11).In order to validate Eq. ( 9), numerical simulations were performed in the same way as previously.For fixed values of N, ρ, p 1 , and d, the number of times where the true value of d falls in the confidence interval was counted in 5000 simulations.This process was repeated varying the values of ρ, p 1 , and d as previously (N was set to 200).The true effect size d is on average 94.9% of the times in the confidence interval, validating the approach.
Equation (9) shows that the size of the confidence interval is proportional to 1= ffiffiffiffi N p .As an illustrative example, a population of N ¼ 200 students was generated, with p 1 ¼ 40%, p 2 ¼ 70%, and ρ ¼ 0.7.The observed proportions of correct answers were p obs 1 ¼ 43% and p obs 2 ¼ 74%.The corresponding observed effect size was 0.81 with a 95% confidence interval of [0.61, 1.01].The true effect size is 0.75 falling in the confidence interval.Another population of 400 students was generated using the same values for p 1 , p 2 , and ρ.The observed scores were p obs 1 ¼ 37% and p obs 2 ¼ 71%, leading to an observed effect size of 0.85 with a 95% confidence interval of [0.71, 1.00].Once again, the true effect size falls in this (smaller) confidence interval.
Confidence intervals of ρ using the maximum likelihood method are outputs of the R library.However, its role is only to warn the instructor to conduct more investigations and it is not a guarantee of a good or bad change of scores.Hence confidence intervals are not necessarily needed and the rough estimation given by Eq. ( 7) could be sufficient.

IX. APPLICATION TO PEER INSTRUCTION
Let us consider a practical application case of an introductory mechanics course taught in a French École d'Ingénieurs from January 2016 to April 2016.This course was composed of ten lectures using Peer Instruction followed by tutorials in small groups.During lectures, all 190 students had a personal clicker.On average, 3.7 full Peer Instruction processes (first individual vote, peer discussion, and second individual vote) were performed during a lecture, leading to a database of 37 pre-and postdiscussion scores to the ConcepTests.Because of a participation rate at each vote around 80%, the average number of students answering twice to a ConcepTests is 125 (with a standard deviation of 24).
Results of votes are plotted in Fig. 4. For each ConcepTests, the effect size is estimated from the preand postdiscussion scores using Eq. ( 6).The average effect size is 0.67, a value between the medium (0.5) and large (0.8) limits.One-quarter of the effect sizes are below 0.3 and one-quarter above 0.75.Figure 4 shows that almost all discussions lead to an effect size between small and large, with three above a very large effect.FIG. 4. Results of the peer discussion for the 37 ConcepTests.Each point is the proportion of correct students' answers to a given ConcepTest.Lines are isovalues of the Cohen's d effect size (no effect, small, medium, large, and very large effect).FIG. 5. Effect size as a function of the correlation coefficient estimated using ML for the 37 ConcepTests.Horizontal lines correspond to values for small (0.2), medium (0.5), large (0.8), and very large (1.2) effect sizes.The vertical line corresponds to ρ ¼ 0.3.
From students' answers to items, the polychoric correlation coefficients were calculated using both the maximum likelihood method and the approximative value using Eq. ( 7).Both methods led to similar results (see Appendix F).Efficiency of the discussion process to all items are plotted in a d-ρ diagram in Fig. 5.Most efficient discussions are at the top of the diagram.One item lead to a negative effect size and is not represented on this plot.As seen in the diagram, two items led to a poor correlation coefficient around 0.2.Hence for these two items, there is a high probability that students had voted randomly at one of the two votes-or both of them-or that students had not completely understood the questions.From these results, a threshold value of ρ ¼ 0.3 is suggested in order to detect from abnormal answers.

X. CONCLUSION
Assuming that the observed scores to a given item are explained by a latent distribution, the Cohen's d effect size was expressed in terms of the observed scores before and after the intervention has occurred [Eq.( 6)].One of the main advantages of this evaluation is that this effect size has good measurement properties: it can be used to compare different kinds of interventions-whether the scores at the first vote were the same or not-and to calculate the ratio of efficiency between two different interventions.Moreover, it is broadly used in education research and many other fields.While they were many assumptions used to derive Eq. ( 6), I advocate that it should be used to quantify the change of the scores instead of other indicators, such as the RD, RR, OR or the Hake's g because (i) it is theory grounded on item response theory and probabilistic thinking, (ii) comparisons with other educational studies using the Cohen's d effect size can be performed and (iii) using the Cohen's d effect size along with confidence intervals to the recommended practice of "the new statistics" [33].
Precision of the estimated effect size from the observed data is performed using confidence intervals.Equation (9) gives the 95% confidence interval but other levels of confidence can be calculated by changing the 1.96 value to the corresponding one.The confidence interval is needed in order to demonstrate that a pedagogical intervention has a non-null effect.It is also required to classify two interventions: their confidence intervals should not overlap.
Moreover, the polychoric correlation coefficient was also suggested to detect abnormal behaviors when its value is below 0.3.This threshold value is, at the moment, only a rule of thumb grossly estimated from specific data.Consequently, it should be seen as a first step toward the definition of a more precise rule.
These two indicators are easy to calculate from students' answers and can be implemented in any software, leading to an automatic estimation of the effect of the pedagogical interventions.Results can be plotted in a d-ρ diagram to compare efficiency of different pedagogical interventions.
Finally, while the paper has been focused on dichotomous scores, the approach can be easily extended to partial scores (see Appendix G).

APPENDIX A: RATIONALE FOR CHOOSING THE 2PL MODEL
The 1PL assumes that the discrimination power a is set to an arbitrary fixed value-usually equal to 1.When analyzing a full test composed of multiple items such as a concept inventory, the 1PL and 2PL models differ because the 1PL model assumes that all items have the same discrimination power-i.e., a ¼ 1 for each item-while the 2PL model allows them to have different values.However, in our case, only one item is considered.Hence the 1PL model and the 2PL model are equivalent due to the invariance property [34].
The 3PL model extends the 2PL one by adding a guessing parameter.In the 2PL model, when the proficiency θ goes to minus infinity, the probability to give a correct answer goes to zero, as stated in Eq. ( 1).In the 3PL model, this probability goes to a constant value-greater than zero but lower than 1called the guessing parameter.While being attractive, assumptions and the interpretation of this model have been criticized [35,36].Moreover, assuming a 2PL model still allows us to detect the presence of random votes as shown in Sec.VII.Another reason for not selecting the 3PL model is that it is not compatible with standard latent trait modeling-Eq.( 2) does not hold.As a consequence, the tetrachoric correlation coefficient cannot be calculated.And, finally, a good item should not lead to guessing behaviors because all possible answers reflect common students' answers so that each student votes for the answer they believe to be correct.So guessing behaviors are expected to be infrequent.All these reasons led us to select the 2PL model.

APPENDIX B: RELATION BETWEEN THE OBSERVED PROPORTION OF CORRECT ANSWERS AND THE POPULATION PARAMETERS
The proportion of correct answers to a question depends on the distribution of the students' proficiencies.In latent trait modeling and IRT, it is often assumed that this latter is normally distributed.This assumption will be used in the following; i.e., θ ∼ N ð θ; σ 2 θ Þ.Average proficiency θ and standard deviation σ θ are unknown.However, they are related to the observed proportion of correct answers.A simple relationship between those variables is derived in this section.
Let us note Y ¼ θ þ ϵ, which represents the imperfect measurement of proficiency θ with the error ϵ.Following Eq. ( 2), the proportion of correct answers is given by the proportion of students that have a Y greater than b: where f Y is the probability distribution function of Y.No exact mathematical expression can be found for the distribution f Y because Y is the sum of two variables with different probability density functions: θ, which is normally distributed, and ϵ which follows a logistic distribution.However, the logistic distribution is very close to the normal distribution [25].Hence the 2P Logistic model Eq. ( 1) can be replaced by the 2P Normal Ogive model [37,38] that assumes that the error term ϵ is normally distributed.As a consequence, Y is also normally distrib- . Hence, p is given by where Φ is the cumulative distribution function of the standard normal distribution.This function can be approximated using a logistic function [25]: Reporting Eq. (B3) into Eq.(B2) leads to The remaining Eq. (B4) leads to where 0.6 ≃ 1=1.7.

APPENDIX C: ESTIMATION OF THE MEASUREMENT ERROR
In order to estimate the measurement error in Eq. ( 2), we use the data from Lasry et al. [26] who administrated the FCI twice in a row to 100 students.They reported the average contingency table for the 30 items (cf.Table II).From this contingency table, the tetrachoric correlation coefficient was estimated to 0.85 (95% CI ¼ ½0.82; 0.86).This last one is equal to leading to σ Y =σ θ ≃ 1.09.

APPENDIX D: ESTIMATION OF THE CHANGE OF VARIANCE
Estimating σ 1 and σ 2 from the contingency table alone is not possible because any values of those variances could lead to the same contingency table.Hence, in order to estimate the ratio σ 2 =σ 1 , other data are needed.Using the FCI, we estimated the proficiency-as measured by the FCI-of two groups of 1st year students, once at the beginning of a mechanical course, and the other at the end of the course, i.e., the end of the semester.The two courses both used Peer Instruction during lectures.
The first group was composed of 210 students and their initial pretest score was 12.4 (SD ¼ 5.8).At the end of the semester, the average score was 15.7 (SD ¼ 6).Using IRT, we estimate the distributions of θ before and after the course and found θpre ¼ −0.72 and σ pre ¼ 1.15 and θpost ¼ 0 and σ post ¼ 1.The ratio between the two standard deviations is 0.87.
The second group was composed of 183 students.FCI average scores were 10.1 (SD ¼ 3.9) at the pretest and 13.2 (SD ¼ 4.4) at the post-test.Proficiencies were estimated to θpre ¼ −1.18 and σ pre ¼ 0.85 and θpost ¼ −0.42 and σ post ¼ 0.75.The ratio between the two standard deviations is 0.88.
In both cases, the standard deviation remains close to 1 after one semester of teaching.This is an indication that this ratio could be close to 1 when looking only at the effect of a single small intervention-such as a discussion with peers.

APPENDIX E: CORRELATION COEFFICIENTS
The correlation coefficient between Y 1 and Y 2 is given by where ρ θ is the correlation coefficient between θ 1 and θ 2 .
The correlation coefficient ρ is low if the correlation between θ 1 and θ 2 is low or if the measurement errors are huge (i.e., σ ϵ ≫ σ θ ).

APPENDIX F: VALIDITY OF THE APPROXIMATIVE ESTIMATION OF THE POLYCHORIC CORRELATION COEFFICIENT
The polychoric correlation coefficient was estimated from both the ML technique and Eq. ( 7) for the 37 items and results are plotted in Fig. 6.Both techniques lead to similar values.Equation ( 7) slightly overestimates the value given by the ML method, especially when the effect size is greater than 1.The d-ρ diagram using Eq. ( 7) is plotted in Fig. 7.The threshold value of ρ ¼ 0.3 can still be used to define correlation coefficients that are too low.

APPENDIX G: TAKING INTO ACCOUNT POLYTOMOUS RESPONSES
In the previous sections, only true or false answers were considered-i.e., dichotomous data.This section shows how to evaluate an effect size when multiple scores can be obtained on an item.This is the case, for instance, by taking into account partially correct responses or scoring rubrics, as shown in Table III, or even Likert scales.
A first approach could be to convert the answers to dichotomous scores, for instance, by setting the answer to true only if the highest score was obtained.While this could be a quick solution to the problem, more advanced methods can be used in order to obtain a more precise evaluation of the effect size.The graded response model (GRM) [39] and the partial credit model (PCM) [40] were designed to take into account graded response data.Graded response data consist of a score that is an ordinal number, typically ranging from 0 to M, where higher scores represent better performance on the item.Both the GRM and the PCM rely on logistic 2PL models and they are relatively similar.
The GRM models the probability to obtain a score equal or greater than a given value using a 2PL function: where S is the score obtained and k ranges from 1 to M. The probability to obtain a score equal to k is given by The probability to obtain a score greater or equal to 0 is 1 and the probability to obtain a score greater or equal to M þ 1 is 0. Hence, there are M different P Ã k functions and M þ 1 unknown parameters: a, b 1 ; …; b M .As all the P Ã k are modeled using a 2PL model, Eq. ( 6) can be used to estimate the associated effect sizes, leading to M estimations of the true effect size.Finally, the average value of these M effect sizes can be used to get a final estimation of the true effect size.This is illustrated in Table IV with hypothetical scores obtained by some students.Scores lie between 0 and 3 and the corresponding fractions of students who obtained those scores are reported in columns 2 and 3.The 3 corresponding effect sizes are calculated using values of P Ã k .A similar process can be used if a PCM is used instead of a GRM, with P Ã k ¼ P k ðS ¼ kÞ=½P k ðS ¼ k − 1Þþ P k ðS ¼ kÞ.
Concerning the correlation coefficient between θ 1 and θ 2 , the polychoric correlation coefficient generalizes the tetrachoric one to take into account multiple categorical FIG. 6.Values of the correlation coefficient using Eq. ( 7) as a function of its value using the maximum likelihood method.FIG. 7. Effect size as a function of the correlation coefficient estimated using Eq. ( 7) for the 37 ConcepTests.Horizontal lines correspond to values for small (0.2), medium (0.5), large (0.8), and very large (1.2) effect sizes.The vertical line corresponds to ρ ¼ 0.3.variables [31].Hence it can be directly used with data using an R or Python library.However, assumptions behind this correlation coefficient are only in agreement with the GRM and not the PCM.Hence we recommend for the purpose of the study to use, preferentially, the GRM to calculate the effect size.

FIG. 2 .
FIG.2.Distributions of students' proficiencies.Top, before the intervention; bottom, after the intervention.The vertical dashed line is the question difficulty.Proportions of correct answers correspond to gray areas.

TABLE I .
Contingency table.

TABLE II .
[26]age contingency table of the test-retest for the 30 items of the FCI (data from Lasry et al.[26]).

TABLE III .
Example of a scoring rubric for an item.

TABLE IV .
Example of the calculation of the effect sizes for partial scores between 0 and 3.