Possibility of using the yes/no Angoff method as a substitute for the percent Angoff method for estimating the cutoff score of the Korean Medical Licensing Examination: a simulation study

Purpose The percent Angoff (PA) method has been recommended as a reliable method to set the cutoff score instead of a fixed cut point of 60% in the Korean Medical Licensing Examination (KMLE). The yes/no Angoff (YNA) method, which is easy for panelists to judge, can be considered as an alternative because the KMLE has many items to evaluate. This study aimed to compare the cutoff score and the reliability depending on whether the PA or the YNA standard-setting method was used in the KMLE. Methods The materials were the open-access PA data of the KMLE. The PA data were converted to YNA data in 5 categories, in which the probabilities for a “yes” decision by panelists were 50%, 60%, 70%, 80%, and 90%. SPSS for descriptive analysis and G-string for generalizability theory were used to present the results. Results The PA method and the YNA method counting 60% as “yes,” estimated similar cutoff scores. Those cutoff scores were deemed acceptable based on the results of the Hofstee method. The highest reliability coefficients estimated by the generalizability test were from the PA method and the YNA method, with probabilities of 70%, 80%, 60%, and 50% for deciding “yes,” in descending order. The panelist’s specialty was the main cause of the error variance. The error size was similar regardless of the standard-setting method. Conclusion The above results showed that the PA method was more reliable than the YNA method in estimating the cutoff score of the KMLE. However, the YNA method with a 60% probability for deciding “yes” also can be used as a substitute for the PA method in estimating the cutoff score of the KMLE.

of the Angoff method. The YNA asks panelists to decide whether the MCE chooses a correct or incorrect answer for each item, instead of the percent correct for each item; thus, it is an easier cognitive task than estimating probabilities [4].
The YNA has consistently produced a higher cutoff score than the Ebel method has, but it offers the possibility of faster and easier standard-setting exercises for local, minimum-stakes performance exams [5]. The YNA method forces the panelists to decide "yes" or "no," which introduces a systematic bias that could produce a substantially distorted cutoff score.
The Korean Medical Licensing Examination (KMLE) has a large volume of items-360 items in 2017 and 320 items in 2022-most of which involve problem-solving, and more than 30% of the items have shown a correct answer rate of 90% or higher. Ahn et al. [6] suggested that the conventional standard-setting process, in which all panelists should decide on all items, was not effective for panels requiring considerable time and effort. Furthermore, standard-setting for the mock Korean Nursing Licensing Examination by the Angoff method was reported to be applicable. The item number for the mock exam was 295, so the item rating was done by 4 groups of 16 raters after dividing all the items into 4 groups [1]. To reduce the panel's burden in deciding on a large volume of items of a high cognitive level for a high-stakes test, standard-setting with subtests made by stratified item sampling was recommended [7,8]. The YNA method is another idea to reduce the panel's burden in the standard-setting process; thus, it is essential to compare the cutoff score and reliability between the PA and the YNA.
It is necessary to compare the reliability in terms beyond the cutoff score between various standard-setting methods. The less error, the higher the reliability. To find the cause of the error, an analysis based on generalizability theory is usually used. Generalizability theory [9] has been commonly applied to quantify the relative influences of each factor (e.g., panelists, items, rating rounds) on the variability (reliability coefficient) of cutoff scores, standard error (SE) of measurement, and panelist agreement [10].

Objectives
This study aimed to compare the results of the PA method and the YNA method for estimating the cutoff score of the KMLE.
Specifically, cutoff scores, reliability coefficients, and the error sources and variances were compared between these 2 methods.

Ethics statement
The panelist rating data in this study were reused from open ac-cess data from Harvard Dataverse. The data were produced as a result of research approved by the institutional review board (IRB approval no., 202001-SB-003) for a study wherein Park et al. [7] in 2020 examined the similarity of the cutoff score in the KMLE 2017 test sets with different item amounts using the modified Angoff, modified Ebel, and Hofstee standard-setting methods for the KMLE. Therefore, neither further approval by the IRB nor obtainment of informed consent was required.

Study design
This is a simulation study to compare 2 standard-setting methods using open-access KMLE 2017 panelist data [7].

Setting
In the original research on the open-access data reused in this study, full-day standard-setting workshops were held on February 8 and 22, 2020. Due to the coronavirus disease 2019 pandemic, the workshop was conducted in person on the first day and online on the second day. On the first day, the panelists had an orientation to setting scores using the Angoff method. Table 1 shows the 15 panelists' characteristics They were recruited by the authors of the previous article [7]. Their specialties were divided into the following 4 categories. Most of the experts had at least 3 years of experience in item development for licensing examinations.

Variables
Cutoff scores, reliability coefficients, and the error sources and variances were included as test variables.  (3,105 passed and 158 failed). The mean difficulty was 72.1%, the mean discrimination was 0.17, the reliability was 0.926, and the standard deviation (SD) was 7.33 [10].

Percent Angoff and yes/no Angoff method
Data estimated by all panelists with the estimated correct percent of the MCE according to the PA method were used. In the YNA method, panelists should decide whether the MCE chooses a correct or false answer for each item. It is commonly imagined that the probability of deciding "yes" is 50%, but in this study, we had 5 probabilities of deciding or categorizing "yes": 50%, 60%, 70%, 80%, and 90% (Table 2). Although 60% was the cutoff score of the KMLE, 70%, 80%, and 90% were calculated to identify the overall pattern.

Cutoff score
The cutoff scores were compared between the PA and YNA methods, in which the criteria to decide yes or no were 50%, 60%, 70%, 80%, and 90%. There were 4 ways to calculate the cutoff score in this study: first, the mean score of the panel's decision (M); second, the mean score minus the standard deviation of the panel's decision (M-SD); third, the mean score minus the standard error of the panel's decision (M-SE); and fourth, the mean score minus the standard error of measurement (SEM) (M-SEM). The standard-setting method has a classification error, which includes false-positive and false-negative classifications; thus, it would be reasonable to modify the cutoff score. The cutoff score has an error, so we can calculate the modified cutoff score by considering the SE. Modifying the cutoff score with the SE of measurement (SEM) is another way to correct the error. The SE focuses on the panel's decision, whereas the SEM emphasizes the quality of the test and its reliability [2,4]. The calculation formulas are as follows:

Generalizability test
The effect size of the error variance can be estimated by the generalizability test [9], as well as the generalizability coefficient and the dependability coefficient. The used generalizability theory model was a random facet nested design, symbolized as [i × (s:p)], in which item (i) is crossed with panelist (p), and panelist is nested in specialty (s). There were 360 items, 15 panelists, and 4 specialty categories. The data entry is shown in Table 3.

Hofstee method
Ahn et al. [6] in 2018 proposed a 2-step standard-setting process for deciding the final cutoff score. The first step involves setting the standard with the modified Angoff or Ebel method and checking whether these cutoff scores developed in the first step are acceptable with the Hofstee method [6]. Park et al. [7] in 2020 used a standard-setting with the Hofstee method in KMLE in 2017. In the Hofstee method, panelists should answer 4 questions comprising minimum and maximum acceptable passing  scores and failure rates [7]. The acceptable failure rates for the panelists were 10.2% (maximum) and 4.0% (minimum). The acceptable cutoff scores were 69.5% (highest) and 56.25% (lowest). The calculated cutoff score was 61.9% for the 2017 KMLE on a percentile scale. The cutoff scores of this study will be checked for acceptability with a Hofstee graph for the KMLE [7].

Bias
None.

Study size
This study was based on the panelists' opinions, so the study size was not estimated.

Statistical methods
A descriptive analysis was done using IBM SPSS ver. 27.0 (IBM Corp., Armonk, NY, USA). G string V was used for the analysis based on generalizability theory [11].

Comparison of cutoff scores between the PA method and the YNA method
The results of the standard-setting process using the PA and YNA methods are presented in Table 4. The cutoff scores were compared between the PA method and YNA method, in which the criteria to decide "yes" or "no" were 50% (YNA-50%), 60% (YNA-60%), 70% (YNA-70%), 80% (YNA-80%), and 90% (YNA-90%). There were 4 ways to calculate the cutoff score in this study: first, the M; second, the M-SD; third, the M-SE; and fourth, the M-SEM. The SD and reliability of the KMLE test scores were 7.33 and 0.926, respectively, so the calculated SEM was 1.99. The SE of measurement was 1.99 [12].
According to the Hofstee results in Fig. 1, the maximum acceptable cutoff score was 69.5%, the minimum acceptable cutoff score was 56.25%, and the final estimated cutoff score was 61.9% for the 2017 KMLE [10]. The cutoff scores with the PA, YNA with 60% (YNA-60), and with M-SD, were acceptable based on the Hofstee results. Although the cutoff scores were above 69.5% (the maximum acceptable cutoff score), most of the cutoff scores with the YNA 60% method were nearest and most similar to the Hofstee results. In Fig. 1, the lowest cutoff score and failure rate were marked with A, B, and C. The cutoff scores marked with D, E, F, and G were within the Hofstee score range. The highest cutoff score and failure rate were marked with I, J, K, L, M, and N. With A, we

Comparison of the size of error source between the PA and YNA methods
The estimated variance component of the [i × (s:p)] model was analyzed using the generalizability test. The results are presented in Table 5 and Fig. 2.
Compared to PA, YNA-50% had a smaller item-related error (i), but a larger rater-item interaction (ip). Although the effect size of the panelist (p) was increased, the variance according to the specialty of the panelist (s:p) was similar. In YNA-50%, the size of the unexplained variance increased, which lowered the reliability significantly. In other words, YNA-50% reduced the differences between items, but allowed the panelists to rate each item differently.
Comparing the YNA method, YNA-50% and YNA-60% had a relatively small item-related variance (i), but the interaction between the panelist and item (ip) was large. If the probability used in the YNA method was more than 70%, the item variance (i) was similar, and there was little interaction between the item and the panelist (ip).
Comparing the variance related to the panelists, the difference between each specialty (s:p) was similar in each method, but the difference between the individual characteristics of the panelist (p) according to the method was substantial.
The reliabilities of YNA-70%, YNA-80%, and YNA-90% were similar, and it could be inferred that the average of the PA was 63.5  (Table 4). It can be said that the panelists evaluated items similarly when the probability was more than 70%. The average between specialty groups was calculated to analyze why the variance sizes of specialty were similar (Fig. 3). Overall, a similar pattern was found for each specialty regardless of the method, but 70% to 90% of internal medicine specialists evaluated the correct answer ratio as higher than other specialties.

Key results
According to the Hofstee's results [7], the cutoff scores of PA were acceptable, being 63.5% (mean), 58% (M-SD), 62.1% (M-SE), and 61.5% (M-SEM). Among the cutoff scores of the YNA method, the cutoff score (61.4) calculated as M-SD with YNA-60% was acceptable and similar to the cutoff scores of PA. The effect size of error variance was estimated with generalizability theory. Compared to PA, YNA-50% had a smaller item-related error (i), but a larger panel-item interaction (ip). Although the effect size of the panelist (p) was increased, the variance according to the specialty of the panelist (s:p) was similar. Comparing the variance related to the panelist, the difference between each specialty (s:p) was similar by standard setting method, but the difference between the individual characteristics of the panelist (p) by method was large. The average cutoff score between specialty groups showed a similar pattern for each specialty regardless of the method; however, 70-90% Y/N Angoff of internal medicine of internal medicine specialists evaluated the correct answer ratio as higher Negative variance components are set to zero when computing the G coefficient and % of variance.
www.jeehp.org 7 Fig. 3. The average cutoff scores by specialty category between the percent Angoff (PA) and yes/no Angoff (YNA) methods. M, mean. than other specialties. The highest reliability coefficients estimated with generalizability theory were found for the PA, YNA with correct percentages of 70%, 80%, 60%, and 50% for the MCE to choose yes or no.

Interpretation
When selecting an acceptable cutoff score based on the Hofstee method result, the PA and YNA-60% (mean score minus the SD) methods were most appropriate. The Hofstee method serves as a guideline for selecting an appropriate cutoff score because it shows the maximum and minimum passing scores and failure rates agreed upon by the panelists [6,7]. When the error variances were analyzed with generalizability theory, PA had higher reliability than YNA. This means that PA had higher intra-and inter-panelist consistency than YNA. In the YNA method, the reliability coefficient according to the percentage of guesses that choose "yes" increased from 50% to 70% but decreased from 80% to 90%. When the panelists set the probability of "yes" to 70%, the panelists' consistency was estimated to be relatively high. The panelist's specialty was the main cause of the error, and the variance size of the panel's specialty was similar regardless of the setting method.

Comparison with previous studies
This study compared the probabilities of correct answers for a person with minimum competency between PA and YNA methods. Similar to previous studies [10,11], the reliability coefficients of PA were higher than those of YNA, and the cutoff scores of YNA were higher than those of PA, so the failure rate was also higher. The YNA method is faster and easier to decide for panelists, but the reliability is relatively lower, so the YNA is good for use in local, medium-stakes performance exams [5].
According to the results of the generalizability test in this study, the magnitude of the panelists' variance was large, and the panelist's specialty had a great deal of influence on the score on the written test. Because most results of generalizability were analyzed with a performance test, the variance of the professor's specialty for a clinical performance examination with a single rater, such as a standardized patient or a professor, could not be calculated [10].

Limitations
This study compared the PA and YNA methods. The YNA data were hypothetical data calculated from the PA data, and the results between simulation data and real data were compared.

Suggestions
Although the error variance size of the panelists was high, there are not many papers on the cause of the error in panels. The pan-elist's specialty was explained as a major error source. Because this study found that the different specialties of the panelists were a source of systematic error, it will be possible to analyze the rating pattern of the panelist in the future, implement pre-training to reduce errors between panelists, and adjust the score after determining the cutoff score. The YNA method was suggested as an alternative to reduce the burden on panelists. When there are many items to be decided by the panelists, sub-tests among all items have also been suggested as an alternative [7,8]. In order to reduce the burden of panelists and increase reliability when setting standards, an alternative can be to adopt various methods such as subtests and simplified methods.

Generalizability
Although it is essential to set a reliable cutoff score in the criterion-referenced evaluation, standard-setting is a time-consuming process that is burdensome for panelists. The PA method is popular but not easy to use to determine the probabilities of many items. It is reasonable to determine the criteria for "yes" or "no" with a 60% chance for local and medium-stakes tests to reduce time and provide an easy process for panelists.

Conclusion
The PA method is more reliable than the YNA method. If the YNA method is used, the 60% criterion to decide "yes" by the panelist is recommended because it had a more reliable coefficient, acceptable scores based on Hofstee results, and a similar cutoff score to the PA method. The panelist's specialty was the main source of error variance, and the size of the ratio was similar regardless of the setting methods. ORCID Janghee Park: https://orcid.org/0000-0002-4163-5729

Authors' contributions
All the work was done by Janghee Park.

Conflict of interest
No potential conflict of interest relevant to this article was reported.

Funding
This work was supported by the Soonchunhyang University www.jeehp.org 9 Research Fund.