Item pre-knowledge true prevalence in clinical anatomy - application of gated item response theory model

Background Computer and paper examinations in our days are constructed from an item pool which is regularly updated. Given the way that exams are created, one of the major concerns is the security of the items that are being used in order to ensure a good estimation of abilities. The aim of this study is to measure the prevalence of item pre-knowledge in our medical school. Methods The Deterministic, Gated Item Response Theory Model (DGM) was applied to estimate the prevalence of students who have had item pre-knowledge from six multiple choice examinations of the Clinical Anatomy course at the Faculty of Medicine of University of Porto. Each examination consisted of 100 items with an average of 200 students and 20% repeated items per examination. The estimation of the sensitivity and specificity was based on a simulation study. The sensitivity and specificity estimates, and apparent prevalence were used to estimate true prevalence of cheating students in the examinations under study. Results The specificity in the DGM for different simulation scenarios was between 68 and 98%, while the sensitivity ranged from 60 to 91%. The apparent prevalence was between 0.0 and 3.4%, while the true prevalence ranged from 1.2 to 3.7%. Conclusions The true prevalence was much lower compared to the students self-reported copying of responses from other students; however, it is important to keep monitoring the pre-knowledge prevalence in order to enforce measures in case an increase occurs. Electronic supplementary material The online version of this article (10.1186/s12909-019-1710-z) contains supplementary material, which is available to authorized users.


Background
Over a medical school course, it is very common to assess developed skills through multiple choice examinations [1]. An exam is constructed from an item pool which is regularly updated. Given the way that exams are created, one of the major concerns is the security of the items that are being used to ensure a good estimation of abilities. In some situations, students have item pre-knowledge either by over exposure or by item illicit access, and their item responses deviate from the underlying item response theory (IRT) by inflating their test scores [2].
Illicit access to items would be considered academic cheating. Academic cheating is defined as unethical or unauthorized academic activity, and is usually related to examinations [3]. A coordinated and purposeful exposure of items is very worrisome and would magnify examination scores for students who have gained examination preknowledge, while punishing honest students and consequently threaten the examination's validity [4]. Additionally, item responses may move away from the subjacent IRT model [5].
Modeling potential behavior for students possessing prior item knowledge is further compounded by the issue of whether this knowledge is actually used to gain some advantages on the examination [4]. That is, modeling the impact of prior item knowledge is difficult because we need to identify disclosed items and we cannot disregard students who may have access to this information [6].
Several studies have shown that innocuous repeating of a small set of items within a larger examination had little impact on performance [2,7,8]. For example, in a national USA certification test in radiography, the same test or a different test form were assigned for the individuals that repeated the examination and indistinguishable score gains between the two groups were found [9]; a similar result was observed for the Medical Council of Canada Evaluating Examination [10]. Normally, the testmaker can control the proportion of reused items when assembling the test, however due to lack of time or economical pressure this is not always done.
The self-reported prevalence of item pre-knowledge was about 25%, while the self-reported prevalence of copying answers during an examination at least once during medical school has ranged from 52% [11] to 67% [12].
The most common way to detect copying answers or item pre-knowledge is using Classical Testing Theory (CTT) or Rasch IRT modeling to identify miss fitting response patterns. These miss fitting response patterns, especially on lower ability candidates on the examination overall, although not conclusive evidence of "cheating" per se, suggest that irregular behavior might have been engaged in order to achieve the correct responses on difficult questions (something we would not expect from low ability candidates). In this context, several classical statistics [3,[13][14][15] and software [16] have been developed to detect cheating on multiple-choice examinations.
Furthermore, several item pre-knowledge detection statistics have been recently developed [17,18] and those that showed better efficiency were the posterior shift and the Shu Deterministic, Gated Item Response Theory Model (DGM) [19].
In 2013, Shu et al. proposed the DGM that classifies students as cheaters or non-cheaters according to score gain in the exposed items (e.g.: repeated items on previous examinations) compared to the non-exposed items (e.g.: new items) [18].
The proposed DGM consists of a two One-parameter Logistic (1-PL) model mixture [20][21][22] which classifies students into two groups, cheater and non-cheater by conditioning them to two types of items; the first type includes the items that are probably exposed, and the second type, the non-exposed items. The DGM allows item pre-knowledge detection through the analysis of the variation between students' item pre-knowledge ability and their true ability.
Although, previous studies have measured the apparent prevalence (AP) (percentage of students classified as having item pre-knowledge), no studies have measured the true prevalence (TP) (percentage of students who truly have item pre-knowledge) as they did not take into account the sensitivity (SEN) and specificity (SPE) of the detection method.
In the case of high pre-knowledge item prevalence, the design of the examinations of Clinical Anatomy will need to be restructured.
The aim of this study was to estimate the item pre-knowledge true prevalence among medical students in the course of Clinical Anatomy at the Faculty of Medicine of University of Porto (FMUP) through the application of the DGM.

Methods
All multiple choice examinations from the Clinical Anatomy course between 2008 and 2011 were analyzed to estimate the prevalence of students who had item pre-knowledge.
In each year, there were two final examinations which comprised a total of eight examinations. Each examination consisted of 100 standard multiple choice questions (MCQ) (five response options where only one was the correct answer), for a total of 800 items.
Each of the 100 items in each examination was compared with all other examination items in order to verify whether the item had been reused. The year 2008 was considered as the starting year and was excluded from the analysis because it did not contain any reused items. The items classified as reused were treated as exposed items, since students may have memorized items from a previously provided examination. The items used for the first time in the examination were treated as non-exposed items.
Initially, the data description was carried out using CTT in order to better comprehend the items' characteristics; 1-PL and Two-parameter Logistic (2-PL) IRT models [20][21][22] were applied in order to validate the 1-PL model choice used in the DGM.
The 1-PL and 2-PL models were estimated using the marginal maximum likelihood estimation and the Expectation-Maximization (EM) algorithm [23,24]. The chosen 1-PL model was the logistic model in which the discrimination parameter was estimated to be identical in all items.
In this study, the selected model was defined according to the Akaike Information Criterion (AIC) [25,26], the Bayesian Information Criterion (BIC) [27,28] and Convex Hull (CHull) method [29]. The model that better fits the data has the lowest AIC and BIC values, and the highest CHull value.
The difficulty (percentage of students who correctly answered the item) and discrimination index (biserial correlation between the item and the number of correct answers to the other items) of item examination were described using the mean and standard deviation (SD).
To assess whether there were significant differences between the examinations or number of repetitions, in the difficulty and discrimination indexes, mixed effect models were used with a fixed factor (examination or number of repetitions) and the item-level random intercept to account for the residual correlation within the same (reused) items.
Data were aggregated by item in order to eliminate the residual correlation within students that repeated the examinations; therefore, the previous model needed to include a student-level random intercept. The main reasons for aggregating data were data sparsity due to the reduced number of items reused and the small number of students that repeated the examinations; furthermore, item-level characteristics (e.g. the number of repetitions of the items) were the features of interest in this study.
The recommendations for the interpretation of the difficulty index suggest that values between 0 and 30% indicate a difficult item; values ranging from 31 to 80% imply an item with medium difficulty; values between 81 and 100% can be labeled as an easy question [30]. The recommendations for the interpretation of the discrimination index suggest five categories: values between − 1.00 and − 0.19 indicate negative discrimination; values ranging from − 0.20 to 0.19, weak discrimination; values between 0.20 e 0.29, sufficient discrimination; values from 0.30 to 0.39, good discrimination; and between 0.40 and 1.00, very good discrimination [31].
Cronbach's alpha was used to assess the examination reliability. Recommendations suggest that examinations with 50 or more items have a good reliability if Cronbach's alpha value is equal to or greater than 0.8 [32]. The alternative coefficient ω h and ω t of McDonald [33] was used as well to evaluate the reliability (general factor saturation and the inter-consistency, respectively) of the examinations.

DGM
As referred previously, DGM is composed by a mixture of two 1-PL models which allows students to be classified into two groups. This classification takes into account the students results obtained in the secure and exposed items. Thus, DGM uses, on the one hand the true ability, θ tm , to characterize the real skill of the m th student ,m = 1, …, M, and on the other hand, his/hers cheating ability, θ cm to estimate cheating efficiency.
Therefore, DGM classifies each student with item preknowledge (cheater) or without item pre-knowledge (non-cheater) according to a specific threshold value.
Each item of the test is classified either as compromised or secure according to the fact that it is a reused item or not. Thus, for each item, i, the item exposure status, G i , is dichotomously defined as Assuming that true and cheating abilities are known, student can be classified as a cheater if his/her true ability is lower than his/her cheating ability. Therefore, for each student is considered the dichotomous indicator variable T m where T m = 1 represents that the m th examinee is a cheater.
The goal of conditioning the two item types is to use the information provided from the secured items to infer the level of item-compromise contained in the exposed items. The probability that the m th examinee answers correctly to the i th item is where b i represents the item difficulty index. Both G i and T m are dichotomously defined, therefore, the DGM can be further broken down to four conditional models: When the student is classified as a non-cheater, T m = 0, the responses to all items are based only on his/her true ability, θ tm , and therefore do not depend on θ cm . However, when T m = 1, that is, for students that are cheaters, it is necessary to take into account whether the items are exposed or not. Student answers to the unexposed items (G = 0) are based on their true ability (θ tm ), while responses to the exposed items (G = 1) are based on their cheating ability (θ cm ). Accordingly, cheating ability only influences the response probability of cheating students in the exposed items.
Taking into consideration the G i and T m values, the probability of the m th student correctly answering item i can be written as a unique expression emphasizing the mixture structure of the model used.
In order to discriminate if the student is classified as cheater or non-cheater, it is necessary to fix a value representing the cut-off point. This threshold was defined according to the probability of a student being a cheater (T = 1), P c (0 < P c < 1), by the DGM. Shu et al [11] used the fixed value of 90% as the cut-off point P c , while in the present study, we also used a classification tree to identify the best cut-off point value of P c to classify students with or without item pre-knowledge. Classification trees are a statistical method used to construct binary trees, by successive divisions of data according to a rule that divides the data into groups as uniform as possible [34]. Homogeneity between the two constituted subgroups is defined by impuritya measure that takes the zero value in completely homogeneous subgroups. In classification trees (the response variable is qualitative); impurity can be measured by the amount of entropy, which must be minimized since it measures heterogeneity within groups. Thus, the criterion used to choose the best cut-off point from all possible cut-off point values was the one that minimized entropy.

Simulation study
This subsection aims to describe the conditions of the simulation study that supported the analysis of sensitivity and specificity of the DGM as well as the best choice of the cut-off point that distinguishes cheaters from non-cheaters.
The simulation study was carried out considering the closest conditions to the ones verified in the Clinical Anatomy course examinations. In real data, there were an average of 20 reused items and 200 students per examination, and those values were used in the simulation study. The simulation study must take into account the item pre-knowledge characteristics, including the proportion of item pre-knowledge and the effectiveness of item pre-knowledge. The proportion of item preknowledge refers to the percentage of students who have pre-knowledge of the exposed items. The effectiveness of item pre-knowledge is the effective score gain as a result of prior knowledge of the exposed items. According to the score gain level, the most effective students (higheffective) obtain the most effective gain and low effective (low-effective) obtain a lower effective gain. We considered four scenarios with four levels of proportion of item pre-knowledge, 5, 10, 35 and 70%, and two levels of cheating efficacy of item pre-knowledge, high-effective and low-effective. For each of the scenarios, we simulated 100 replicates.
The items' difficulty (b) was simulated according to a standard normal distribution. The student's true ability (θ t ) was simulated according to the standardized normal distribution, N(0, 1) and student's cheating ability (θ c ) was obtained by the sum of the effective score gain, (Δ), to true ability. In a non-cheating student, the effective gain is zero, while for a cheating student; it is simulated from a beta distribution. When the cheating category is high-effective, the score gain is characterized by Beta(9, 4) * 3 and when it is low-effective, it is simulated according to Beta(5, 5) * 3.
Thus, we can summarize the distributions used in the simulation of the parameters related to items and students of the DGM as with Δ = 0 for the non-cheater, Δ~Beta(9, 4) * 3 for the cheater high-effective and Δ~Beta(5, 5) * 3 for the cheater low-effective.
Let Y mi , m = 1, …, 200, i = 1, …, 100, be the response of student m to item i. Y mi were generated using the equations for the exposed items and cheaters, and for all other cases.

Estimation of the DGM
The parameters of the DGM were estimated using Markov chain Monte Carlo (MCMC) [35,36] methods through Gibbs algorithm [37]. The following prior distributions were considered: These variables are i.i.d for m = 1, …, 200, i = 1, …, 100 Since the distributions of θ tm and θ cm do not depend on the considered student, for simplification, considerer θ tm = θ t and θ cm = θ c . WinBUGS' DGM commands are available in Additional file 1.
For each DGM, sample parameters were generated, with dimension 110,000 from the posterior distribution, which include a burn-in period of 10,000 observations to ensure the convergence of Markov chains in the sampling process. Only observation parameters with a 100 iterations jump in order to obtain a sample, with dimension 1,000, of approximately uncorrelated observations were stored.

Estimation of the true prevalence
In real data, we do not know if a student is a cheater or not. When we apply a DGM, it tells us which students were classified by the model as cheaters (positive test). The percentage of those students is referred to as the apparent prevalence (AP) and is obtained by We want to know the percentage of students who are truly cheaters; the true prevalence (TP) [38] is A Bayesian approach can be used to estimate the TP [39] using the following relationship with the AP and taking into account the sensitivity (SEN) and specificity (SPE) of the DGM.
The SEN is the percentage of students who were correctly classified as cheaters and the SPE is the percentage of students who were correctly classified as non-cheaters [40].
To obtain the TP, we used the SEN and the SPE means and SD computed in the simulation study. The minimum SEN and SPE for the uniform distribution were fixed for the DGM classification as the minimum and the maximum mean for all scenarios in the simulation study.
The R software [41] was used for statistical analysis and for programming.
Furthermore, the estimation of parameters was performed by Gibbs algorithm, implemented in WinBUGS through the R2WinBUGS package [42], the rpart package [43] for the classification trees, the ltm package [44] to see which model best fit the data, and for the algorithm distributions display and convergence study, we used the coda packages [45] and mcmcplots [46].

Simulation study
The SEN and SPE for the cut-off point of 90% were obtained by computing the 100 replicates of the simulations for the different scenarios showed in Table 1. The SPE was higher than 90%, while the SEN ranged from 60.3 to 90.7%.
The AP in all scenarios was different compared to the TP ( Table 1).
The simulation study showed that for high prevalence, the cut-off value should be decreased, and for low prevalence, the cut-off value should be increased. Figure 1 presents the estimated gain for each one of the scenarios. We can observe that a cheating student obtains a much higher effective score gain than a non-cheating student. For the non-cheating student, the score gain is very close to 0. If we analyze Fig. 1a and b we can observe that for the same proportion of item pre-knowledge (35%), students obtain a higher effective score gain when it is high-effective; the same happens for the proportion of item pre-knowledge (70%) (Fig. 1c and d).

Application to real data Data description
A total of 1008 students completed the examination between 2008 and 2011, from those 774 (76.8%), 218 (21.6%), 14 (1.4%) and 2 (0.2%) completed the examination 1, 2, 3 and 4 times, respectively. Table 2 shows the number of reused items, the number of items reuses, the students' mean score, and the items' difficulty and discrimination mean levels and respective Cronbach's alpha and McDonald's ω h and ω t for each examination.
From a total of 800 items, 84 (10.5%) were reused once, 13 (1.62%) twice, and the percentage of repetitions ranged from 4 to 26%. The mean items' difficulty index was between 0.57 and 0.66, there were statistically significant differences in the difficulty index by examination (p = 0.0471), and all examinations showed a medium difficulty level. The mean items' discrimination index ranged from 0.30 and 0.37, and there were statistically significant differences in discrimination index by examination (p = 0.008); however, all examinations presented good discrimination. Cronbach's alpha was above 0.8 in all examinations, which showed that all examinations have a good reliability. The ω t showed high internal consistency and the ω h moderated the general factor saturation for all examinations. The index of difficulty increased 3.5% (p = 0.013) in the first repetition and 6.9% (p = 0.036) in the second repetition compared to the first time, meaning that with repetitions, the items were easier for the students (Table 3).

Goodness-of-fit of 1-PL model
In order to assess if we could use the 1-PL model to fit the data we compared the 1-PL and 2-PL models to verify which one gives the best fit to the real data. Table 4 presents a summary of the goodness-of-fit index for year and period. Using BIC and CHull, the 1-PL model better fits the data in the eight examinations. Using AIC, in five of the eight examinations, the model that fits better is the 2-PL model (Table 4).

Item pre-knowledge prevalence
The DGM estimated that the AP ranged from 0.00 to 3.30%, and the TP after using the information SEN and SPE from the simulation study was between 1.20 and 3.70% for all examinations (Table 5). This situation happens in all studied examinations and can be seen in Fig. 2, where for students considered not cheaters regardless of item exposure or not, the percentage of the students correct answers practically does not change; the same cannot be said for students considered cheaters. In this case, the percentage of correct answers in the exposed items increases very significantly when compared to the percentage of correct answers in the unexposed items. This was expected considering that the DGM model more easily detects the students with item pre-knowledge with low ability. Those students will have a high gain in the number of correct answers compared to students with high ability where the gain would be smaller, and consequently more difficult to detect. Additionally, these students (with low ability) will be more effective compared to the high ability students in the exposed items, since the main focus will usually be items memorization from past examinations compared to high ability students that use all types of information and so will not be so effective in memorizing items.
Focusing only on the non-exposed items, there is a considerable difference between the two groups of students revealing differences in their true skills that can also be explained by the arguments referred to above.

Discussion
In this work, the DGM was applied to six multiple choice examinations of FMUP's Clinical Anatomy course. The proportion of pre-knowledge items in the analyzed examinations ranged from 1.2 to 3.7%, that is, in this course, the proportion of item pre-knowledge is low compared to the self-reported prevalence of copying answers during an examination at least once during medical school, which has ranged from 52% [8] to 67% [9] and the self-reported prevalence of item pre-knowledge was about 25%. When compared to the prevalence using detection statistics for copying answers, the prevalence was high, for example, in 11 examinations held by the Royal College of Pediatrics and Child Health, there was a prevalence of 0.1% [3]. In a low-stake test for measuring student proficiency in Grade 4 English, the prevalence of item pre-knowledge was about 9% [11].
The low prevalence in this study may firstly be due to the fact that students do little study by previously provided examinations or to the fact that students study by the previously provided examinations but also simultaneously through other sources and therefore there is no big difference between the students' true ability and their cheating ability because these students have a high true ability. The second hypothesis is supported by the fact that no differences in the difficulty index were detected between exposed and unexposed items within the examination itself, however, over the years, significant differences in the difficulty index were detected and exposed items became increasingly easy. % of students who correctly answered the item b Biserial correlation between the item and the number of correct answers to the other items c To assess whether there were significant differences between the examinations, in the difficulty and discrimination indexes, mixed effect models were used with a fixed factor (examination) and the item-level random intercept to account for the residual correlation within the same (reused) items To assess whether there were significant differences between the number of repetitions, in the difficulty and discrimination indexes, mixed effect models were used with a fixed factor for repetitions and the item-level random intercept to account for the residual correlation within the same (reused) items This is the first study that tries to measure the TP of item pre-knowledge; other studies have used the AP determined through a diagnostic test, which will differ from the TP. In our case, we showed that the "apparent" prevalence would underestimate/overestimate the TP depending on the examination.
The simulation study was required to assess the effectiveness of the DGM under the same conditions of the real data and the DGM was applied to the real data in order to estimate the TP of cheating students per examination. The simulation study showed the effectiveness of the DGM when the number of items per test is high (100), the proportion of the exposed items is low (20%) and the number of students is small (200). The absolute agreement of the DGM with these conditions was more than 76%. In the previous study by Shu, the effectiveness of the DGM was studied for an examination with 40 items, a proportion of committed items higher or equal to 30% and 15,000 students, and the cut-off point was set at 0.9. Our study showed that in the case of a high preknowledge prevalence, the SEN was lower compared to the SPE, thus increasing the bias between the AP and the TP. Changing the cut-off value from 90% to lower values would decrease the difference between the SEN and the SPE, thus decreasing the bias between the AP and the TP (data not shown). If the test-maker has a priori information that the preknowledge prevalence is high, they should lower the threshold in order to use the AP as an estimate of the TP.
One possible constraint of this study was the fact that the 1-PL model used by the DGM could not fit the real data and diminished the diagnostic capacity; however, the BIC showed that the 1-PL model had a better fit compared to the 2-PL model.
The analyzed examinations had a medium difficulty, good discrimination and good reliability scores using both the CTT and the IRT, showing that the low prevalence of item pre-knowledge did not have a large impact on the quality of the examinations.
Moreover, it is worthwhile to mention that one restriction of the present work is the small scale of the study. Surely, it would be of interest to apply DGM in a larger scale with the increase of response sample size and the inclusion of clinical courses in which item re-usage is more common. This remains a topic for future research.