Scoring Short Answer Questions of Five Borderline Medical Students

This work was carried out in collaboration between all authors. Authors HP and PD compiled the questions and marked the answers. Authors MT and HP administered the tests and authors JB and MT analysed the data. All authors read and approved the final manuscript. ABSTRACT Background: The assessment of medical knowledge is integral to becoming a medical practitioner estimates, their approach to answering SAQs were vastly different, altering the interpretation of their overall performance. Conclusion: The sole use of CTT in the analysis of examination data may result in issues of validity and reliability when measuring clinical competence. The Rasch rating scale measurement framework may be invaluable in informing the analysis of performance in high stakes scenarios to ensure fair decisions of clinical competence.

estimates, their approach to answering SAQs were vastly different, altering the interpretation of their overall performance. Conclusion: The sole use of CTT in the analysis of examination data may result in issues of validity and reliability when measuring clinical competence. The Rasch rating scale measurement framework may be invaluable in informing the analysis of performance in high stakes scenarios to ensure fair decisions of clinical competence.
Keywords: Assessment; Rasch measurement; short answer question; classical test theory; medical education.

BACKGROUND
A standard method used to obtain an overall score of a test or examination is to simply add the scores on the questions. This is based on a framework known as Classical Test Theory (CTT) [1]. Such methods however, disregard the subjective nature of the data by making unwarranted assumptions [2]. One assumption is that the data is on an interval scale (i.e. that the relative value of each response category across questions are the same and the unit increases across the rating scale are equal in value). In other words, it is assumed that each question contributes just as much to the total score as any other question and that all questions are equally difficult [3]. It is further assumed in CTT that the scores within a question are equally spaced, i.e. that the difference between an increase from a score of '1' to '2' is the same as the difference between an increase from a score of '2' to '3'.
Rasch measurement theory provides an alternative approach to the analysis of data [4,5]. The family of Rasch models are based on the idea that data must conform to some reasonable hierarchy of 'less than/more than' on a single continuum of interest [6]. The Rasch model uses the traditional total score as a starting point for estimating probabilities of responding. The model is based on the simple idea that all persons are more likely to answer easy items correctly than difficult items, and all items are more likely to be passed by persons of high ability than those of low ability [7].
The Rasch model provides estimates for each question (difficulty) and each person (ability) separately, but on the same scale; something that is not possible in Classical Test Theory [3,6]. Equality of intervals is achieved through log transformations of raw data odds, and abstraction is accomplished through probabilistic equations [8]. The person ability and question difficulty estimates, having been subjected to a log transformation, are displayed along a logit (log odds unit) scale which is an interval scale in which the unit intervals have a consistent value or meaning.
When questions are scored according to a marking guide, an extension of the basic Rasch model is needed to accurately represent such polytomous data [6]. Rating scale analysis allows each question's relative difficulty to be estimated, as well as the pattern of the scale categories in each question to yield a rating scale structure. Thus, each question has a difficulty estimate, and the scale itself also has a series of thresholds [9].
As stated above, one of the main advantages of Rasch measurement theory over classical (traditional) theory is that item difficulty estimates and person ability estimates can be located on a common interval level scale [3]. This can also be done for rating scales, where the difficulty to 'achieve' each category can be shown on a scale [10].

METHODS
The study was designed to compare the performance of Australian medical students across two scoring regimes, namely, Classical Test Theory (CTT), and the Rasch rating scale measurement framework. A unique username and password was generated for each participant in order to access the practice examination. These were only provided to participants on the day of their examination, upon sign-in. The examination consisted of 40 Short Answer Questions (SAQs), of which 20 were marked on a four-point scale (0 to 3); ten questions were marked on a three-point scale (0 to 2) and ten questions were marked on a two-point scale (0 or 1) to yield a maximum possible score of 90. The questions were administered through an application on the Google cloud platform, Google Forms and included the eight disciplines; medicine (med), surgery (surg), psychiatry (psych), orthopaedics (ortho), general practice (GP), obstetrics / gynaecology (O&G), paediatrics (paed) and anaesthetics / pain medicine / intensive care (APIC).
Students were allowed 90 minutes to complete the test and all responses were captured online and collated for scoring and analysis. The responses were distributed to three independent markers, which were marked according to a marking guide.

Classical Test Theory
The average of the three markers' scores was calculated for each question and rounded to the nearest whole number. Each student's performance was expressed as the sum of the question scores. Table 1 summarises descriptive statistics of students' total scores. The results show a minimum score of 23.3 per cent, a maximum score of 76.7 per cent and a mean of 52.2 per cent. If 45 out of 90 is considered as the passing score (i.e. 50%), then there were five students exactly at the passing score. These five students' scoring patterns are considered in more detail below. The Cronbach alpha reliability of 0.77 yields a standard error of measurement of 4.7 which can be used to determine a precision band around the passing score, and this average standard error is applied in the same way for all students' scores.

Rasch Rating Scale
In the Rasch rating scale analysis, the first step was to determine whether the questions were well targeted to the students. Fig. 1 shows a good match between the student ability measures (red bars) and the 40 question difficulties (blue bars).
The mapping confirms good targeting with the questions providing maximum information around the peak of the student ability estimates. Since each question also had a number of score categories (two to four), more detailed information about the targeting was obtained by plotting the individual category difficulties of each question against the student ability estimates; see Fig. 2.

Fig. 2. Person-item threshold distribution
Good targeting was confirmed and it is noted that there were three (especially two) categories that were very difficult to achieve, i.e. to obtain such scores.
The power of test-of-fit was investigated next, and the Person Separation Index (PSI) of 0.77 indicated that 77 per cent of the variance in the observed scores was due to the estimated true variance in students' levels of clinical competence and that the error variance, which includes marker severity, is 23 per cent. The mean student ability of 0.046 logits (Standard deviation of 0.449) matched the mean question difficulty of 0.000 logits (standard deviation of 1.210) very well. The student mean fit residual of 0.019 (SD of 0.853) and the question mean fit residual of 0.290 (SD of 0.792) showed slightly more misfit in the question estimates and the chisquare probability value of the question-trait interaction indicated that Rasch analyses could be done.
The analysis confirmed significant differences between the overall question difficulties, ranging from -3.251 logits (question 10) to 3.381 (question 35). It was the most difficult to score 3 in question 35 (category difficulty of 5.762 logits) followed by a three in question 14 (category difficulty of 5.197 logits) whilst it was the easiest to score a one in question 27 (category difficulty of -2.036 logits).
The step structures of the questions were used to explore the scoring structures in more detail. Question 2, for example, suggested that scoring in all categories was not always probable (see Fig. 3). The scoring structure showed that it was never most likely to score 1 (red line) in this question on a scale of 0 to 3. Students with lower ability estimates most likely scored zero (blue line) after which a score of 2 (green line) and then 3 (purple line) became more likely as ability estimates increased. In contrast, some questions such as question 4 were much better ordered as can be seen in above Fig. 4.

Results of Borderline Students
According to CTT, five students (N1 to N5) were identified as being exactly at the pass mark of 50 per cent, with a score of 45 out of 90. The distribution of their marks over the eight disciplines can be seen in more detail in Table 2.
In the Rasch analysis a score of 45 equates to an ability estimate of -0.047 logits with a SEM of 0.207. As seen in the table above, although the students had the same total scores of 45 out of 90, their performance was very different when discipline scores were examined in detail. Student 1 (N1) had a score of 11 per cent in General Practice (GP), while the four other borderline students had a score of at least 56 per cent as can be seen in Table 2. If GP was considered a fundamental area of knowledge in determining the clinical competence of a student, it can be argued that this student should have failed. If 50% was the cut-score for each of the eight disciplines, student 1 would have passed six disciplines, student 2 four and the other students five each. If a pass in at least six disciplines was added as a criterion to pass, only student 1 would have passed.
Although the conclusions above provide additional information about the performance of the students, the assumption is that the difficulty over the disciplines is constant, i.e. that 50% in GP is the same as 50% in Med. Furthermore, it is noted that the number of questions over disciplines is not the same. Requiring 50% to pass in GP is thus different to requiring 50% to pass in Ortho.
When the data fits the Rasch model, ordinal raw scores are converted into a metric linear interval scale using the unit of logits. Such measures can subsequently be used to compare performances directly because all measures are expressed on the same scale. Calibration of all questions in the test yielded a mean question difficulty of 0.00 logits (by definition) with a standard deviation of 1.210 (as mentioned above). The mean student ability was 0.046 logits with a standard deviation of 0.449. From a score equivalence table it was derived that 50% overall equates to an ability estimate of -0.047 logits.
The exam was subdivided into the eight disciplines and the mean difficulty of each discipline was calculated. These are summarised in the table below. It is noted that there is quite significant differences in the mean discipline difficulties. Surgery was the "most difficult" and Medicine the "easiest"; a difference of 0.88 logits which clearly indicates that a score of 50% in one discipline is not the same as a score of 50% in another.  It is noted that the biggest difference is between students 2 and 4 and therefore these two students will be further considered in more detail. In the following table the performance of student 2 is compared with the performance of student 4 in logits for each discipline.

DISCUSSION
Passing or failing borderline candidates has been and perhaps, will always be a contentious issue. This study demonstrates that simply adding scores on individual questions to obtain an overall score may be misleading, especially if subsets of scores are used to make pass/fail decisions. Although sum scores give some indication of relative performance, subset scores are not on a common scale and should therefore not be compared directly.
Scores on different scales are likely to lead to biased interpretation whereas the location of measures on a single scale, as common practice in Rasch measurement, overcomes such potential bias. Through exploring the scoring structures of the questions and Rasch calibration to construct a single scale, valid comparisons can be made. The Rasch rating scale measurement framework provides a myriad of benefits in analysing data in high stakes examination situations due to the ability to provide more detailed information on individual performance.
Corroborating the findings of Tor and Steketee 4 , the use of Rasch modelling in assessing clinical competence in medical students can provide much needed quality assurance in high stakes examinations.
The certification process in medical education often requires candidates to pass multiple forms of assessment. Through the application of Rasch measurement theory in the psychometric analysis of these, it is possible to create one common scale to locate performance across all assessments. This may allow regulating bodies and verification authorities to maintain that, the same standard is required to pass any form of a certification exam, at any point in time.
Although Rasch analysis is not sampledependent, it is noted that the sample in this study was from one Australian University. It is not envisaged however, that other samples in the target population of medical graduates would differ greatly to the current sample. In addition, the examination was devised as a practice formal examination only, with question content developed by content experts. In this cause, the range of 2.01 in the fit residual statistics may be accounted for since students knew it was not a critical examination.

CONCLUSION
A comparison of CTT and Rasch analysis on the SAQ data of borderline medical students evidenced that Rasch provides long term advantages in assessment in medical education through providing critical information on individual score patterns and the assessment of clinical competence on SAQs. Future research could utilise the Rasch model to examine differences in individual ability estimates when individuals are given the choice to select which SAQs / Cases they respond to.