Associations between item characteristics and statistical performance for paediatric medical student multiple choice assessments

Background: Multiple choice questions (MCQs) are commonly used in medical student assessments but often prepared by clinicians without formal education qualifications. This study aimed to inform the question writing process by investigating the association between MCQ characteristics and commonly used statistical measures of individual item quality for a paediatric medical term. Methods: Item characteristics and statistics for five consecutive annual barrier paediatric medical student assessments (each n=60 items) were examined retrospectively. Items were characterised according to format (single best answer vs. extended matching); stem and option length; vignette presence and whether required to answer the question, inclusion of images/tables; clinical skill assessed; paediatric speciality; clinical relevance/applicability; Bloom’s taxonomy domain and item flaws. For each item, we recorded the facility (proportion of students answering correctly) and point biserial (discrimination). Results: Item characteristics significantly positively correlated (p<0.05) with facility were relevant vignette, diagnosis or application items, longer stem length and higher clinical relevance. Recall items (e.g., epidemiology items) were associated with lower facility. Characteristics significantly correlated with higher discrimination were extended matching question (EMQ) format, longer options, diagnostic and subspeciality items. Variation in item characteristics did not predict variation in the facility or point biserial (less than 10% variation explained). Conclusions: Our research supports the use of longer items, relevant vignettes, clinically-relevant content, EMQs and diagnostic items for optimising paediatric MCQ assessment quality. Variation in item characteristics explains a small amount of the observed variation in statistical measures of MCQ quality, highlighting the importance of clinical expertise in writing high quality assessments.


Introduction
Multiple choice questions (MCQs) are a common assessment modality in health professional student and vocational training.MCQs have multiple advantages including ease of preparation, adaptability to a range of content, familiarity and amenability to automated marking 1 .MCQs can evaluate student knowledge, including higher-order thinking 2 .However, poorly written items may favour the 'exam-wise' candidate rather than fulfil the goals of assessment.
The 2018 Consensus framework for good assessment 3 articulates the features of an optimal single assessment including validity, reproducibility, result equivalence across sites or testing periods, feasibility, educational effect, catalytic effect and stakeholder acceptability.Favourable MCQ attributes include topic salience, option homogeneity and the absence of construction flaws (e.g., negative questions, absolute terms and logical or grammatical cueing) 4,5 .Much has been written about item optimisation [4][5][6][7] , however, existing MCQ construction guidelines are largely derived from consensus rather than quantitative evidence.The process of preparing and reviewing MCQs is labour intensive and often depends on authors with expertise in content rather than medical education, resulting in wide variability in item statistical performance 8 .Psychometric analysis can help to identify flawed items.Commonly used indices 9 include item facility (proportion answering correctly) and discrimination (e.g., 'point biserial' measure of correlation between item score with total test score).Some studies explore the effect of individual item characteristics, e.g., vignettes 10 and images 11 on item performance.However, few examine the influence of multiple item characteristics to provide evidence-based recommendations for clinicians preparing MCQs for medical students 12 .This retrospective study addressed three research objectives using a convenience sample of five 60 item MCQ annual summative paediatric assessments undertaken by medical students in the final or penultimate year of an Australian graduate-entry medical program: 1. Identify how item statistical performance (facility and point biserial) varies according to item characteristics.
2. Determine the item characteristics that are most predictive of statistical performance.
3. Evaluate statistical performance indicators across testing periods.

Ethics statement
This study is governed by the umbrella ethics of the Education Office of the University of Sydney (HREC 2016/456).Some of the item characteristics were highly intercorrelated (r≥0.75),thereby measuring the same item attribute.'Number of options' and 'non-clinical skill' were excluded due to positive correlation with item type (i.e., SBA) and an inverse correlation with 'high' clinical relevance respectively.Outliers were identified for stem and option length using a threshold of 1.5 times above/below the interquartile range and recoded to the upper/lower fences [13][14][15] .

Item statistical performance indicators
Item facility and point biserial results were obtained from the University's assessment database.Consideration was given to using the Rasch measure of item difficulty, but these were highly intercorrelated with facility (>0.99) and less easily interpretable.As the reliability of point biserials are questioned at the extremes of facility (i.e., <20% or >80%), for the purpose of analyses, we also classified items as 'low' or 'moderate to high' difficulty based on facility >80% or ≤80% respectively, comparable to other publications 9,12 .Most (64% of items and 68% of item usages) were of moderate to high difficulty.We also classified items into two groups based on the point biserial being ≥0.15, which accounts for 31% of items and 37% of item usages.Other authors 12 have used a higher figure for their classification but for our homogeneous medical school cohort, the point biserial typically averages 0.10.

Statistical methods
Research Objective 1 was evaluated through linear correlations with original values and Chi-Square analyses using the binary classification of facility and point biserial outlined above.ANOVA was used to test the differences between means.
Research Objectives 2 and 3 were assessed using regression and descriptive analyses respectively.All analyses were conducted using SPSS Version 28 (IBM Corp. Released 2021).

Reviewer agreement of item characteristics and flaws
There was unanimous agreement between all 3 coders for 85% of the initial coding (67% of the subjective characteristics).40% of the disagreements represented a one level difference in Bloom's taxonomy or clinical relevance.Including the independent coder resulted in 83% initial agreement across 4 coders (62% for subjective characteristics).The number of item flaws identified were: none (84% of items); one (16%); or two (<1%).Almost all item flaws (93%) were identified by the same coder (one of the two trainees).Figure 1 summarises item characteristics.
Similarly, Chi-Square analysis did not identify an association between the binary categorisation of item facility or discrimination (χ 2 =0.049, 1 degree of freedom, p>0.05), highlighting the independence of these quality markers.Levene's test for homogeneity was supported for the point biserial for easy or moderate/difficult items (p>0.05).

Research objective 1 -variation in item statistical performance by item characteristics
The correlation between item characteristics and statistical performance was similar for the 173 unique items and the 293 item usages (Table 1).Facility was significantly correlated with a vignette being present and required (r=0.21 for items, r=0.22 for usages), diagnostic items (r=0.18) and not being a recall item (r=-0.18for items, r=-0.21 for usages).For the 293 usages, the length of the stem (r=0.16),clinicallyrelevant items (r=0.12) and application items (r=0.18) were also significantly positively correlated with facility.However, when the binary classification of item difficulty level and discrimination were analysed (Table 2), no item characteristics were significantly associated with an item being low difficulty.
Similarly low, but significant, correlations were evident between the point biserial and EMQs (r=0.22 for items, r=0.18 for usages), diagnostic items (r=0.18),subspecialty items (r=0.21 for items, r=0.17 for usages) and not being a general paediatrics item (r=-0.16).A significant but negative correlation was found for the point biserial and management items (r=-0.12)for the 293 item usages.The average point biserial for the 173 items was significantly correlated with the total option length (r=0.16) and usage frequency (r=0.26),reflecting more frequent usage of more discriminating items.EMQ and diagnostic items were again associated with discrimination when analysed using the binary classification of point biserial (p<0.05)(Table 2).

Research objective 2 -item characteristics most predictive of statistical performance
Two linear regression analyses were conducted to address this research objective, with facility and point biserial as the dependent variables for the 293 item usages.Only variables identified as having a significant bivariate correlation (Table 1) were entered into each equation.As shown in Table 3, both

Table 1. Correlation between item characteristics and item performance.
The association between item characteristics and item performance measures (facility and point biserial) was evaluated through linear correlations for five annual summative paediatric medical student assessments.Due to the re-use of some items over the years, results are expressed for all unique items (n=173) and all item usages (n=293 regressions were significant at p<0.05, but the amount of variance explained by the 6 item characteristics for facility and the 6 for discrimination was negligible (7% and 8% respectively).The item characteristics used to predict item facility were characterised by more intercorrelation than those used to predict point biserial, as reflected in the Variance Inflation Factor (VIF).As none exceed five, multicollinearity was not sufficient to exclude any variables.No variables in either equation were significant independent predictors, as shown by the t-test results (Table 3).

Research objective 3 -evaluating statistical performance indicators across testing periods
Items used twice were more likely to be difficult (mean facility difference -2.6%) and discriminating (point biserial mean difference +0.04) than those used once (Figure 2).The significant correlation between frequency of use and discrimination (r=0.26,p<0.05) was demonstrated earlier in Table 1.However, there was no association (p>0.05) between usage frequency and whether it was a low difficulty or discriminating item on Chi-squared analyses (χ 2 =4.57, 3 degrees of freedom and χ 2 =4.79, 3 degrees of freedom, respectively).

Discussion
We found two item characteristics were significantly associated with facility for five years of data in our paediatric item bank; use of a vignette required to answer the item and diagnosis items.The presence of a vignette has previously been shown to be associated with item facility 10 , given that it confers 'context-richness' 7 .The greater facility of diagnostic items is not surprising given that it is a more commonly practiced skill than management in the pre-vocational years.Interestingly, there was also a significant relationship between facility, stem length, clinical relevance and application items for the 293 usages, perhaps because these attributes allow more relevant clinical content to be expressed, although the longest stem was still only 149 words.By contrast, recall items were negatively correlated with item facility, were less likely to have a vignette and were less discriminating.It is worth noting that 3 of the 7 items discarded prior to calculating each students' score, assessed recall of an epidemiological statistic.This highlights the need to use recall items judiciously.However, the magnitude of the associations was small and these correlations were not reproduced when the data was grouped into 'low' or 'moderate/ difficult' items.
With respect to item discrimination, significant positive associations included EMQ format and items addressing diagnosis or a paediatric subspeciality.These results are consistent with existing studies of the matching question format 12 .The greater discriminating power of diagnosis items may be due to the focus on assessing skills required for the first post-graduate year.The greater discriminating power of subspeciality paediatric items over general paediatrics items may be an indicator of higher performing students covering more of the content compared with borderline students covering only what was essential to pass.When confining analysis to the 173 items, the point biserial was significantly correlated with total length of options (maximum 132 words) and usage frequency.EMQ and diagnosis items remained significantly associated with point biserial in Chi-squared analyses.As expected, re-used items tended to be at least moderately difficult and discriminating, reflecting re-selection of high-performing items.
Although we identified significant correlations between item characteristics and performance, the size of this effect was small.Less than 10% of the variation in item statistical performance was explained by the item characteristics examined.This requires further study.Although only a small quantum of the item facility and discrimination was explained by modifiable item characteristics, these are some of the easiest item construction variables to modify.Secondly, these data confirm the relative importance of pre-assessment influences, such as teaching content and delivery.Moreover, this result is not unexpected from a highly vetted MCQ examination bank where item construction characteristics should not alter student performance.This study's strengths include it being one of few [10][11][12] to quantitatively examine the influence of item characteristics over statistical measures of item performance in medical education.Hernandez et al. 12 addressed the association between item construction, difficulty and discrimination in 125 first year pathology MCQs.Our work focuses on final stage students in paediatrics where image inclusion is less common and vignettes are more frequent and relevant.
Our study also included summative items only, a greater number of items (173), over a longer period (5 years) with multiple item coders.Common findings between our studies are that EMQs are more discriminating than SBAs and higher order Bloom's taxonomic domains (understanding and application items) are not associated with lower item facility and/or discrimination as one might expect.This supports the approach of incorporating items from all strata of the taxonomy.Unlike Hernandez et al. 12 , our study unexpectedly found that recall items were inversely correlated with item facility.One possible explanation for this is that the recall items in their study (96% of all items) were predominantly examining basic science content, whereas a significant portion of the recall items in our study (12% of items) were epidemiological (45%).
Limitations of our study include the dataset being derived from a single specialty rotation at an individual institution.Several item characteristic variables were also removed due to low frequency (item writing flaws, presence of images/ tables) but the prevalence and relevance of these characteristics may vary in other settings, limiting generalisability.Although we used a protocol to reduce heterogeneity between item reviewers, we still found variation in item coding.This highlights the subjectivity of this research method, demonstrating the importance of item review by multiple individuals.Future studies could attempt to repeat this study across multiple sites, in a more diverse range of specialities and using items that more frequently include images.

Conclusions
We found that statistical performance of individual MCQs may be improved through the addition of vignettes, diagnosis items, EMQs, clinically relevant content, longer stems and items requiring knowledge application.In our study, paediatric subspeciality content was associated with greater discrimination than general paediatrics content, with recall and management items being negatively associated with facility and discrimination respectively.Item preparation should be informed by these findings.Nonetheless, the magnitude of these associations was small, highlighting the importance of pre-examination pedagogical influences.

Introduction
There have been a number of papers published about MCQ writing over the past two decades, and institutions such as NBME have extensive internal analyses also, so I disagree that 'existing MCQ guidelines are largely derived from consensus rather than quantitative evidence' (

Methods / results:
The various statistical analyses are quite dense to review, even with experience in psychometrics, and I would advise giving some references on interpretation of psychometrics, perhaps again within the introduction, but certainly within the discussion as appropriate (De Champlain 2010, Tavakol and Dennick 2011, Bibler Zaidi, Grob et al. 2017).

1.
The RPB in this study seem low, and should be commented upon as a valuable piece of information for quality improvement.It is generally advised that RPB be > 0.2 for assessments (Beichner 1994, Engelhardt 2009, Bibler Zaidi, Grob et al. 2017).

2.
I suggest adjusting the sentence at the end of Page 4, column 2, paragraph 4 to 'EMQ and diagnostic items were again associated with *higher* discrimination when analysed' 3.
For Research Objective 3, were the Facility and Discrimination of these items known when they were chosen for reuse?My assumption is that their performance statistics were likely the very reason for their reuse? 4.

Is the work clearly and accurately presented and does it cite the current literature? Partly
Is the study design appropriate and is the work technically sound?Yes

Are sufficient details of methods and analysis provided to allow replication by others? Yes
If applicable, is the statistical analysis and its interpretation appropriate?I cannot comment.A qualified statistician is required.

Have any limitations of the research been acknowledged? Yes
Are all the source data underlying the results available to ensure full reproducibility?Yes

Are the conclusions drawn adequately supported by the results? Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Anatomy & Assessment.While I'm very familiar with the practical use of item statistics (and deliver some training on them) I'm neither a statistician nor a psychometrician, so for original research aspects such as the linear analyses and multicollinearity are beyond my expertise and should have further specialist peer review.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Are the conclusions drawn adequately supported by the results? Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: psychometric analysis I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Figure 1 .
Figure 1.Characteristics of the 173 unique items analysed.Five consecutive annual barrier paediatric medical student assessments (each n=60 items) were examined retrospectively and their construction characteristics were coded by 3 item reviewers.The distribution of item characteristics is summarised for the 173 unique items coded.

Figure 2 .
Figure 2. Net item facility (a) and net point biserial (b) with repeated item use.All items used during five consecutive summative paediatric medical student assessments were collated.Some items were re-used during this period and the usage frequency was recorded.The association between usage frequency, net item facility (a) and net point biserial (b) is shown.
). r Spearman Rank Order correlation and p for a two tailed test * significant at <0.05 ** significant at p<0.01.'Usage' included items being used more than once

Table 2 . Association of item characteristics with low difficulty and discriminating items (n=173).
The item facility and point biserial for all items included in five years of summative paediatric medical student assessments were recorded.Items were grouped according to a binary classification of facility (>80% or ≤80%) and point biserial (<0.15 or ≥0.15).The association between item characteristics for the binary classification of item facility and point biserial was evaluated through Chi-square analysis.p is significant at <0.05.

Table 3 . Regression prediction of facility and discrimination (n=293 item usages) by item characteristics.
Two linear regression analyses were conducted to determine whether item characteristics were predictive of facility and point biserial for five years of summative paediatric medical student assessments.Only variables identified as having a significant bivariate correlation were entered into each equation.p is significant at <0.05.VIF=Variance Inflation Factor.
Swanson, Case et al. 2001, Wood and Cole 2001, Haladyna, Downing et al. 2002, NBME 2021).There are a number of papers exploring the influence of item flaws, which could be referenced and discussed in comparison to these findings, including:Haladnya, Costello, Casu, Pais, Rodríguez-Díez, Rush, Tarrant, Ali et al (Haladyna and Downing 2004, Tarrant, Knierim et al. 2006, Tarrant and Ware 2008, Ali and Ruit 2015, Pais, Silva et al. 2016, Rodríguez-Díez, Alegre et al. 2016, Rush, Rankin et al. 2016, Casu and García-Garcíab 2018, Costello, Holland et al. 2018, NBME 2021) Additionally, there have been papers published comparing 5 option SBAs to EMQs, if wishing to make note of this, the ones I would recommend reading & referring to is that by Coderre and/or Swanson (Coderre, Harasym et al. 2004, Swanson, Holtzman et al. 2006).Of course, SBAs can use different numbers of options (my institution has moved from five to three for some programmes), so long as the cut score is adjusted, or the assessment design modified to account for moving from a 20% to 33% guess rate (Swanson, Holtzman et al. 2005, Tarrant and Ware 2010, Schneid, Armour et al. 2014).Sylvain Coderre has also written on testing Clinical Reasoning in MCQs, as have Heist, Beullens and others, which could be included.(Beullens,Van Damme et al. 2002, Coderre, Mandin et al. 2003, Beullens, Struyf et al. 2005, Heist, Gonzalo et al. 2014).A number of authors other than Holland et al have explored the influence of images in MCQs, which I think could also be referenced: (Vorstenbosch, Klaassen et al. 2013, Sagoo, Vorstenbosch et al. 2021) Notebaert makes reference both to his findings on images in MCQs, and the manner in which he used to compare items from different cohorts -no need to duplicate this, but you should mention it as an alternative method of analysis, and give some brief justification as to why you used the methodology by Hernandez instead (Notebaert 2016). ○