Using Student Ability and Item Difficulty for Making Defensible Pass/Fail Decisions for Borderline Grades

The determination of Pass/Fail decisions over Borderline grades, (i.e., grades which do not clearly distinguish between the competent and incompetent examinees) has been an ongoing challenge for academic institutions. This study utilises the Objective Borderline Method (OBM) to determine examinee ability and item difficulty, and from that reclassifying each Borderline grade as a Pass or Fail. Using the OBM, examinees’ Borderline grades from a clinical examination were reclassified into Pass or Fail. The predictive validity of this method was estimated by comparing the examination original and reclassified grades to each other and to subsequent clinical examination results. The new model appeared as more stringent (p<.0001) than the original decisions. Implications for educators and policy makers are discussed. The OBM2 is found to provide a plausible solution for decision making over borderline grades in non-compensatory assessment systems.


Modern Test Theory and Decision Making
Modern test theories such as item response theory (IRT) have become more commonly used in medical education (Boulet et al., 2003;Downing, 2003).IRT methods have been mostly used for improving the quality of written test items rather than determining Pass/Fail cutoff scores for clinical examinations (Downing, 2003;Schuwirth & Vleuten, 2010), or helping to calibrate test items before applying other commonly used standard setting methods (Boulet et al., 2003;Ferdous & Plake, 2008;Grosse & Wright, 1986;MacCann & Stanley, 2006;Wang, Wiser, & Newman, 2001).The most advanced standard-setting method that uses IRT framework for determining Pass/Fail cutoff score is the Bookmark method (Buckendahl, Smith, Impara, & Plake, 2002;Karantonis & Sireci, 2006;Lewis, Mitzel, & Green, 1996;Peterson, Schulz, & Engelhard Jr., 2011).Nonetheless, the Bookmark method has been criticized mainly for being resource intensive and the use of arbitrary value (.67 probability of success) to establish the point that is used to rank order items for the judges' booklets, which may explain why it has not been widely used for setting cutoff scores in clinical examinations (Karantonis & Sireci, 2006;Lewis et al., 1996).It is also noted that although findings suggest that the Bookmark is preferable over Angoff method (Peterson et al., 2011), concerns about judges' ability to make reliable decisions despite that additional information remain (Davis-Becker, Buckendahl, & Gerrow, 2011;Deunk, Van Kuijk, & Bosker, 2014).

The Objective Borderline Method: An Alternative Method for Decision Making over Borderline Grades
In response to the abovementioned challenges, Shulruf et al. (2013) recently introduced the Objective Borderline Method (OBM) which is based upon a measure of difficulty of the examination in question, formed from the initial results that the students' actually obtained on this examination.There are two separate but related and fairly natural measures of difficulty available (Raykov & Marcoulides, 2011;Sax & Reade, 1964).One seeks to combine these two measures into a single measure.There are innumerable ways in which this might be done.A simple, plausible and perhaps intuitively way is to think of the two initial measures (which are both numbers between 0 and 1 and generated from observed proportions of those two categories) as being notionally the probabilities of success on two independent tests or experiments.The measure is formed simply as the product of these two probabilities and may thus be conceptualized as the probability of success on both tests.It must be emphasized here that these two independent tests are conceptual only.However, they serve as a useful heuristic guide to our thinking in constructing the combined measure.This combined measure is just an index (similar to other indices e.g.BMI) that its validity is determined only by its usefulness.This combined probability or index is by no mean the probability of the occurrence of any actual event (Shulruf et al., 2013).
Explicitly the initial results consist of a number of Fail, Borderline, and Pass grades.The first notional test consists of drawing a grade at random from the collection of all Fail and Borderline grades."Success" is considered to be drawing a Borderline grade, and the first measure of difficulty is the probability of drawing a Borderline grade (i.e. the observed proportion of Borderlines among the pool of Borderlines and Fails).The second notional test consists of drawing a grade at random from the collection of all Borderline and Pass grades.
Here "success" is considered to be drawing a Pass grade (i.e. the observed proportion of Passes among the pool of Passes and Borderlines).Each of these two tests (achieving Borderline rather than Fail and achieving Pass rather than borderline) is a common measure of difficulty when a test includes two categories (Raykov & Marcoulides, 2011;Sax & Reade, 1964).If the numbers of Fail, Borderline and Pass Grades are n P , n B , n F respectively then the probability of success on the first notional test is P r1 = n B / (n F + n B ) and the probability of success on the second notional test is P r2 = n P / (n B + n p ).The combined measure of difficulty is then P r = P r1 × P r2 = (n B / (n F + n B ) × (n P / (n B + n p ). Figure 1 schematically illustrates the how the OMB index is calculated.
The OBM utilizes P r in such a way that it assigns conceded Pass to the proportion of Borderline grades equals to P r .and conceded Fail to the remaining Borderline grades.
It is acknowledged that like all standard setting models the OBM is derived from some arbitrary premises (Cizek, 1993).However, Shulruf et al. (2013) demonstrated that the OBM is at least as effective as other standard setting methods (e.g. the Regression and the Borderline Groups methods).

The Overarching Objectives of the Current Study
The current study introduces a modification to the OBM model which enables making pass/fail decisions for any types of marks (continuous or categorical) as long as marks can be initially classified into three categories: Pass, Borderline and Fail and the number of Passes is greater than zero.It offers a practical and theoretically defensible method to determine which of the Borderline grades, within a categorical set of grades, should be considered as Pass and which should be considered as Fail.The improved model is named the Objective Borderline Method 2 (OBM2) as it uses two measures (examinee ability and item difficulty) to determine whether a Borderline grade should be reclassified as Pass or Fail.Unlike OBM, the OBM2 does not establish a cut-score but it determines whether a Borderline grade should be Pass on Fail on a case by case basis.Such a solution may betefit any panel of examineres who need to make pass/fail decision over borderline grade for non-compensatory assessment systems, where high score in one domain cannot compensate for a low score on another.A thorough review that took place in the preparation of this study failed to identify any such method.

The Objective Borderline Method 2 (OBM2)
The original OBM estimates the combined probability (P r ) of being successful in two notional tests based on the counts of Passes, Borderlines, and Fails (n P n B n F respectively) for a set of examination scores of a group of examinees.OBM2 uses the same approach as the OBM but at the item rather than the examination level, assuming all items are unidimensional (Hattie, 1985).
Consequently, when a group of examinees are assessed using a set of unidimensional items and their performance is classified as Pass, Borderline or Fail, it is possible to calculate two different combined probabilities (i.e.P r ) for each Borderline grade.The first is based on the particular examinee's grades across all items (referred to as the examinee's P r and denoted as P e ); the second P r is based on the grades for a particular item across all examinees (referred to as the item's P r and denoted as P i ).Analogous to Item Response Theory (Kolen & Brennan, 2004), P e is a measure of examinee's ability and P i is a measure of item difficulty.The relationship between these two probabilities (P e and P i ) can be used to determine whether the Borderline grade should be conceded Pass or Fail.This relationship is expressed by a decision index (P d ), which is the quotient P d = (P e ) / ((P e ) + (P i )).
When P d ≥.5 it means that the examinee's ability is greater than or equal to item difficulty hence the Borderline grade should be conceded Pass.Note that when P d =.5 the Pass/Fail decision cannot be determined by this index.
In this case the decision must be determined by a pre-specified policy.
The current study aims to estimate the validity of the OBM2 by examining what the consequences would be if Borderline grades of medical students' clinical examination were reclassified as Pass or Fail using the OBM2.

Data
The UNSW Medicine program is a 6 year undergraduate entry program organized into three phases, each comprised of two academic years.At the end of each phase, students must pass a clinical skills examination before progressing to the next phase.This study used data from the Phase 1 and Phase 2 clinical examinations from five cohorts of students.
Each of the Phase 1 and Phase 2 clinical skills examinations comprises six standardised stations (for more details on the curriculum and clinical assessments see: McNeil, Hughes, Toohey, & Dowton, 2006).The students are assessed in nine criteria encompassing generic communication skills, clinical history skills and physical examination skills.A standard grading sheet is used at each station with additional specific descriptors relevant to the station's tasks.A common 4-point grading system is used for each criterion: Fail, Borderline, Pass and Exceptional.The examiners do not provide a global grade for the station.A Pass/Fail decision for each station is based on the proportion of Fail and Borderline grades-failing a station is a result of at least two Fail grades or a combination of one Fail grade and more than two Borderline grades.A Pass/Fail decision for the examination is based on the number of failed stations-students must pass at least three stations.Each grade is also converted to a numerical score (with Borderline representing 50% of maximum score); a Fail decision is also made if a student's total numerical score is <50%.Students who fail the Phase 1 clinical skills examination are offered a supplementary examination after a period of remediation.Students who fail the supplementary examination are excluded from the program.

Sample
Test data were available from 1,136 students who sat the Phase 1 clinical examination.Of these students, 42 did not progress to the Phase 2 clinical examination and their grades in Phase 2 clinical examination were considered in our analysis as Fail.This inclusion was based on data not presenting in this study suggesting that the discontinuation of those students was due to dissatisfactory performance in their clinical and non-clinical studies in Phase 1.Thus, this analysis includes all 1,136 students (Y2004, N=210; Y2005, N=229; Y2006, N=226; Y2007, N=238; Y2008, N=233).Demographic data such gender, age or ethnicities were not included in the dataset and the analysis as they were not deemed relevant to the model discussed.

Analysis
The first analysis employed factor analysis of raw examination scores within each station to ensure unidimensionality of the items (Hattie, 1985).Then, within each station the decision index (P d ) was calculated leading to the assignment of Pass/Fail to each Borderline grade that was originally given to a student for a performance criterion within a station.Next, based on the OBM2 reclassification of grades, pass fail decisions for the whole clinical examination (all six stations) were calculated in the way described above (students must pass at least three stations and total score from all stations must exceed 50%).
The last stage compared the predictive validity of the original grades in the Phase 1 clinical examination with the reclassified grades derived by the OBM2.The sensitivity, specificity, positive and negative predictive values, and accuracy (overall fraction correct) of the Phase 1 grades for predicting performance in the subsequent Phase 2 clinical examination were measured (see Table 1) (Bossuyt, 2011).

Suitability of the Data
The results indicate that the data did not fully meet the criteria for unidimensionality as in some stations the items were loaded on two factors.Nonetheless, Table 2 suggests that within each station there is only one meaningful underlying factor since none of the factor loadings in any of the six stations met the criteria for two discrete factor structure (Pett, Lackey, & Sullivan, 2003) and the variance explained by the first factor was between 28 and 35 percent whereas the second factor explained no more than 6%.Thus, it was decided to carry on with the analysis, particularly given the high level of internal consistency within each station (Cronbach's alpha= .80,.80,.77,.82,.79,.82 for Stations 1 to 6 respectively).The average percentage of Borderline grades that were reclassified as Pass (by criterion by station) was 25.8% (range 0.0-58.8%).The comparison of the Pass/Fail decisions of the Phase 1 clinical examination across the original grades indicates that the OBM2 model was more stringent than the original decision, yet the decisions made by the OBM2 had high level of agreement with the "original decisions" (decisions made by the board of examination within the institute) (Accuracy=.88)(Table 3).The quality of the OBM2 was estimated by comparing the overall clinical examination grades in the Phase 2 clinical examination with the overall outcomes of the Phase 1 clinical examination as calculated in two ways: by the original method and by the OBM2 model.

Discussion
The main objective of this study was to utilise the recently introduced Objective Borderline Model (OBM) (Shulruf et al., 2013) for supporting pass/fail decisions for students who performed at the borderline level in their clinical examination.This was achieved by modifying the OBM to incorporate two measures (examinee ability and item difficulty) for determining whether a Borderline grade should be reclassified as Pass or Fail.In order to provide robust evidence, this study followed the relevant recommendations for research on assessment from the Ottawa 2010 Conference (Schuwirth et al., 2011): (a) basing the research on robust scientific theory (recommendation 7, 8, and 9); (b) taking the modern approach for validity by looking at consequential validity rather than merely comparing one method with another (recommendations 12, 13); (c) adopting the Item Response Theory (IRT) conceptual framework in the development of a new method (recommendation 18).We note that recommendation 18 was only partially followed as OBM2 applies only one feature analogous to IRT which is the comparison of student ability with item difficulty and in no way it is suggested that IRT models were applied in this study/model.
The OBM and OBM2 introduce a new concept in the field of standard setting by "legitimising" the category of a Borderline grade.The underlying assumption is that a Borderline grade is one of which indicates that the examinee's assessed performance could not clearly be classified as either Pass or Fail and this is a category by its own right (Jalili et al., 2011;Norcini, Shea, & Kanya, 1988).Furthermore, the OBM2 is a plausible solution for making decisions when the data suggest uncertainty (Draper, 2005;Ramsey, 1926).Nonetheless the OBM2 is not a standard setting method in the sense that it does not set any cut-score but only provides evidence-based indication whether a borderline grade should be conceded Pass or Fail.
The underlying assumption of previous standard setting methods is that there is an inevitable misclassification of examinees' proficiency where some truly proficient examinees are mistakenly classified as not proficient (False Negative) and others who truly did not reach the appropriate proficiency level are mistakenly classified as proficient (False Positive) (Cizek, 2012;Cizek & Bunch, 2007).The OBM and OBM2 methods address this concept of misclassification by determining the range of Borderline grades as the range where the level of competency could not be classified without any doubts as either clear Pass or clear Fail (Shulruf et al., 2013).This method of classification applies to the determination of the Pass/Fail scores (the definition of in/competency) and the actual classification of examinee's performance by the examiners (Kane, 1994).Skorupski and Hambleton (2005) for example, demonstrated that the majority of panelists engaged in the item mapping standard-setting method reported having difficulty distinguishing between performance categories.
Thus enabling examiners or judges to use a "Borderline" category and accepting the uncertainty of such a category might be an appropriate approach rather than forcing them to make a decision based on limited information.The actual decision whether a Borderline grade should be reclassified as Pass or Fail would then be decided by all data points available which deemed to be more reliable.
The question of where one should set the cutoff point-employing a stringent policy by granting the final Pass for the clear Pass (minimizing the number of False Positives) or taking a more lenient policy and granting a final Pass to those who did not clearly fail (minimizing False Negative) needs to be decided.This could be decided either by an agreed panelists' opinion who apply judges-based standard setting or by policy makers who decide which test-based (panelist free) standard setting are to be used (Kane, 2013).Since each test-based standard setting applies different mathematical procedure, the results are expected to be somewhat different even if applied on the very same data (Wood et al., 2006).Consequently no standard setting method, including the OBM/OBM2, could be absolutely objective.
In this study we investigated the impact of the OBM2 with respect to a policy that aims to maximize specificity and minimise the number of False Positives.The results clearly demonstrate that both indices would have been improved (in accordance with our pre-determined policy) had OBM2 been implemented.Note that progression to Phase 2 was based on the original decisions (decisions made by the Board of Examination) and thus this comparison is somewhat problematic.However, this limitation would apply to any study using real data which resulted in similar decision making.It is noteworthy, however, that had the OBM2 been used, the specificity would have increased from 7% to 43% resulting with a trade-off of a drop in the sensitivity from 99% to 89% which overall, based on our view, is a preferable outcome for the chosen policy.
It is evident that the OBM2 model is more stringent than the original decisions made by the Board of Examination for determining Pass/Fail and would increase the number of students failing the clinical skills examination.However, this is expected and perhaps even desirable.Clinical examiners tend to avoid failing students and trainees particularly as they give borderline students the benefit of the doubt (Cleland et al., 2008;Dudek et al., 2005;Morton et al., 2007;Rees et al., 2009).Such practice is argued to have the potential for major adverse impact on medical practice (Albanese, 1999).Moreover, in their comprehensive review of sources of bias in clinical performance rating, Williams, Klamen, and McGaghie (2003) summarized compelling evidence suggesting that the tendency for leniency is pretty much embedded in clinical assessment practices with little to no impact of training on such examiners' bias.Consequently, applying the OBM2 model where a Borderline grade is made a legitimate and well defined category (when neither clear Pass nor clear Fail may confidently granted) would help examiners avoiding the leniency bias and minimize passing incompetent examinees (Kane, 1994).
Currently, most grading criteria describe what a competent examinee should demonstrate but fail to define what constitutes borderline performance.Even when the borderline performance is defined, the descriptors are vague, indecisive and poorly correlate with the checklist scores (Pell, Fuller, Homer, & Roberts, 2010).The description of clear Pass and clear Fail criteria aligned to measureable teaching and learning objectives would provide transparent expectations to examinees.
An important contribution of the OBM2 is that it provides Pass/Fail decisions for Borderline grades when the grading system does not use continuous scales but only ordinal categories (e.g.Fail, Borderline, Pass, Excellent).This is an advantage of the OBM2 compared to other standard setting methods, which do not have that capacity, particularly when the use of a continuous scale makes little sense if any.Moreover, many of the previous standard setting models assume that the categories (for example in OSCE stations) are points on an interval scale (Boursicot et al., 2007;Cizek & Bunch, 2007;Kramer et al., 2003) although that assumption receives little support from the statistical and educational measurement literature, particularly when the number of the ordinal categories is fewer than six (Agresti, 2010;Torra, Domingo-Ferrer, Mateo-Sanz, & Ng, 2006).
How acceptable is the OBM2?Although difficult to judge at this stage, there are some indications that it should be easily acceptable.First, it is very easy to use and requires no mathematical/statistical skills (see Appendix 2).
Second, the OBM2 does not add any cost to current programs, except the time required revising the Pass/Fail criteria as described above, which is negligible compared to other methods using panels of experts (Cizek & Bunch, 2007).Third, the analogy between OBM2 and IRT (both use item difficulty and examinee ability) might be appealing.This analogy is a major advance in the field because by definition a Borderline grade by itself includes very little information, only that it is neither clear Pass nor clear Fail.All other information relevant to the examinee's performance on the "borderline item" is embedded in their performance on all other items encompassed within the same dimension (Cronbach, 1951).Furthermore, the inclusion of item difficulty in the OBM2 method acts as a correction for examiner's leniency/stringency bias.The Pass/Fail decision is determined by the comparison of two calculated probabilities.As clearly observed in [Equation 2], the harder the item the smaller the P i thus the greater the P d (and vice versa for an easier item).This correction enhances the fairness of the examination, and removes concerns over examiners' bias, which has its most critical impact on borderline performance.
Obviously the OBM2 raises some challenges which need to be addressed.The OBM2 compares two calculated measures conceptualized as probabilities.However, a small number of items may affect its resolution, which if not fine enough might impair its effectiveness.In this study there were nine items (criteria) for each station which yielded 25 unique values of P ri , providing sufficient resolution for calculating (P d ).Six items, however, would yield only 12 unique values of P ri .Hence we recommend that the OBM2 should be used only when six or more items are included.Nonetheless, further investigation is needed to determine the impact of the number of items on the OBM2 outcomes.
More limitations are related to the study itself.This study used test data from five past clinical examinations (five cohorts).The grading sheets defined Fail criteria based only on curriculum objectives which were deemed to suffice for this pilot study.Since no practical decision was made based on the OBM2, this deviation from the suggested practice is minor.It is therefore, recommended that further studies take a prospective approach to ensure that Pass and Fail criteria are defined as described earlier in this paper.The other minor limitation is that the 42 students who did not progress to Phase 2 were deemed to have failed the clinical examination in that Phase.This decision was made as most of these students did not continue due to poor performance in Phase 1.Since no information on performance in Phase 2 was available, any imputation of data to the clinical examination results of Phase 2 would anyway be based on performance in Phase 1.Therefore, given the low number of failures in the programme, it was believed that including those students in the analysis and assigning them a Fail outcome in the Phase 2 clinical examination would be the most plausible approach.
An important feature of the OBM2 is that it is applicable to non-compensatory assessment systems, where high score in one domain cannot compensate for a low score on another.No previous study was found in the literature to address decision making over borderline grades for such assessment systems.It is acknowledged that this is the first step only and further research may yield better formulae/indices to support decision making over borderline grades either within or beyond the OBM framework.

Conclusion
Michael Kane's definition of validity provides some important insight into this the research on standard setting: "Validity is a property of the interpretations assigned to test scores, and these interpretations are considered valid if they are supported by convincing evidence" (Kane, 2013, p. 56).Like all other methods, the OBM2 has advantages and shortcomings and they have been discussed above in detail.Whether the evidence provided in this study to support the validity of the OBM2 is sufficiently convincing is left to the readers to judge.Nonetheless, unless empirically proved otherwise, the OBM2 is a plausible method for supporting pass/fail decision making for borderline grades, particularly when a non-compensatory assessment system is applied and the risk of passing incompetent examinees who received Borderline grades is of a major concern.

Figure 1 .
Figure 1.The principle of the OBM: combining indices of difficulty

Table 1 .
Definition of true positive, true negative, false positive and false negative (adapted fromBossuyt,

Table 3 .
Distribution of final pass fail grades by decision model *Accuracy= overall fraction correct (proportion of agreement out of all grades)

Table 4 .
Distribution of Phase 2 outcomes by Phase 1 outcome by type of decisionThe results indicate that the original decision yielded accuracy of .93 and sensitivity of .99 but specificity of only .07.The OBM2 model was less accurate but as a more stringent model it yielded the higher level of specificity (.34).71 (6.6%) students passed the Phase 1 clinical examination based on the original decision but failed in Phase 2. In comparison, the OBM2 model passed only 50 (4.6%).The cost of increasing the specificity was that the OBM2 model resulted in failing 115 (11%) of students in the Phase 1 clinical examination who were later successful in the Phase 2 clinical examination.