Comparison of standard-setting methods for the Korea Radiological technologist Licensing Examination : Angoff, Ebel, Bookmark, and Hofstee.

PURPOSE
This study aims to compare the various standard setting methods for the Korean Radiological Technologist Licensing Examination with the fixed cut score and suggest the most appropriate method.


METHODS
Six radiological technology professors, set the standards of 250 items for Korean Radiological Technologist Licensing examination that were conducted on December 2016 by using Angoff, Ebel, bookmark, and Hofstee methods.


RESULTS
With the maximum percentile score of 100, the cut score for the examination was 71.27 in Angoff method, 62.2 in Ebel method, 64.49 in bookmark method, and 62 in Hofstee. Based on the Hofstee's acceptable cut score, the acceptable cut score for the examination was between 52.83 and 70, but the cut score was 71.27 in Angoff method.


CONCLUSION
Above results suggested that the best standard setting methods to determine the cut score was panel discussion with the modified Angoff or Ebel methods, and verification of the rated results by Hofstee method. Because there was still no adoption of standard setting in the Korean Radiological Technologist Licensing Examination, this study will be able to provide the practical guideline to introduce the standard setting.


Introduction
A licensing examination evaluates whether a licensee has appropriate skills in the field after earning a license. The criteria for deciding whether an examinee has the appropriate skills and whether to pass an examinee are very important. The criteria for passing the written component of the Korean Radiological Technologist Licensing Examination (KRTLE) are 60% or higher of the total possible score for all subjects and 40% or higher for each subject.
However, these criteria do not consider the quality and level of difficulty of the items or information about the candidates. In this situation, it can be argued that passing the national examination may not guarantee the minimum competence required by the licensing examination.
In response to this, the Korea Health Personnel Licensing Examination Institute conducted a basic research study to examine possible standard-setting methods [1], and in a recent study, the modified Angoff method was found to be appropriate [2]. National examinations may differ depending on the environment. Currently, the KR-TLE is administered once a year, but it could be changed to an examination held several times annually using computerized adaptive testing. Therefore, it is necessary to identify various methods that can be applied to changing examination forms.
The purpose of this study was to apply various standard-setting methods for the KRTLE, which has a fixed cut score, and to suggest the most appropriate standard-setting method.

Standard-setting methods
Cizek [3] in 1993 defined standard setting as a legitimate and appropriate rule or procedure that assigns numbers to distinguish differences in performance, and emphasized the procedural definition of the standard-setting process.
Angoff: Angoff [4] estimated the percentage of correct answers for each item of a minimally competent person belonging to a virtual group through a content analysis of the test tool, tallied up the scores for the total items, and calculated the cut score. The Angoff method is the most widely applied method, and it is easy to explain.
Ebel: In the Ebel method, a standard-setting panel first examines each item to determine the level of difficulty (easy, appropriate, and difficult) and relevance (essential, important, acceptable, and questionable) of each item. Then, each item is classified into a 3×4 matrix table according to the level of difficulty and relevance. Next, the panel determines the expected percentage of correct answers to the items in each cell of the matrix table by a person who has minimum competency. Lastly, the number of items in each category is multiplied by the expected percentage of correct answers, and the total results are added to calcul ate the cut score [5]. The Ebel method involves a more complex standard-setting process than the other standard-setting methods, which are based on an analysis of the content of the test tool, and it therefore imposes a burden on the standard-setting panel [6].
Bookmark: The bookmark method was first introduced by Lewis et al. [7] in 1996 as a standard-setting method to cal culate the cut score based on the review of a collection of items by standard-setting panelists. This method is called the 'bookmark' method be cause the standard-setting panelists indicate their judgments about a specially created item collection according to the level of difficulty. The specially created item collection is known as the ordered item booklet (OIB). The basic feature of the bookmark method is that it uses item response theory to construct the OIB. The easiest item is placed at the beginning of the OIB, and the hardest item is placed at the end. The advantage of using a scaling method grounded in item response theory is that the item difficulty and the subject's ability are on the same scale [8,9].
Hofstee: Eclectic method of Hofstee [10] in 1983 was developed to address practical problems arising from disagreement between criterion-referenced and norm-referenced predictions. In the Hofstee method, standard-setting panelists answer 4 questions with assumptions about the subjects who first take the test. Two of the questions are about the appropriate level of knowledge that the subjects should have (in dicated as k by Hofstee), and the other two are about the fail rate (indicated as f by Hofstee). The questions are as follows: "First, what is the maximum cut score that would be satisfactory, even if all subjects exceed this score? Second, what is the minimum cut score that would be acceptable, even if all subjects do not reach the score? Third, what is the maximum allowable fail rate? Fourth, what is the acceptable minimum fail rate?" [10].
Selection or allocation of items (subsets): A method in which panelists review all the items and determine the cut score takes a great deal of time and effort because of repeated item review and discussion between panelists. Moreover, many items need to be rated, which can reduce reliability. Ferdous and Plake [11] in 2007 introduced and simulated a method to reduce the number of items that the panelists should rate. The first is to evaluate only some of the items by selecting a subset of items for rating. The second is to divide the total items and rate them. When the panelists rated two-thirds of the total items, the results were similar to the results of rating the total items. That the results of rating more than 50% of items were similar to the results of the overall rating. They suggested that items should be selected or allotted based on their content and difficulty [12,13].

Ethical approval
This study was approved by the Institutional Review Board of Korea University (KU-IRB-18-EX-65-A-1). Informed consent was obtained from participants.

Study design
Descriptive analysis, correlation analysis, and item analysis.

Materials and/or subjects
The radiological technologist licensing examination was the 44th KRTLE, administered on December 18, 2016. Table 1 shows the number of items and the cut score for each subject. On the written test, candidates must score 40% or more of the total possible score for each subject and 60% or more of the total possible score for all subjects. In the practical skill examination, they must score 60% or more of the total possible score.
The panelists selected in the standard-setting workshop included 6 radiologists who were national examiners. The workshop for setting cut scores for the radiological technologist examinations pro-  ceeded from 9 AM to 5 PM on Saturday, May 12, 2018 ( Table 2). The panelists gave feedback from the survey and consultation meeting. At the end of the standard-setting workshop, panelists were surveyed with the following question: "What do you think is the most appropriate standard-setting method for the national examination? Ebel, Angoff, bookmark, or Hofstee." An advisory council was held to present the workshop outcomes and the suggested methods for standard setting for the national examinations, and a discussion was held about the following topics: (1) Which is the most reasonable method to apply for the national examination? (2) Do all the panelists have to rate all the items? (3) If items to be rated are divided, what is the proper method for doing so?

Technical information
In the Angoff method, the panelists reviewed each item and described the expected percentage of correct answers by a minimally competent person. The sum of the expected correct answers for each item was calculated as the cut score.
In the Ebel method in this study, item relevance was classified into the following categories: essential (a task the subject should thoroughly know); important (a major and important task); and additional (an additional task). The expected correct answer rate for a minimally competent person was categorized as hard (50% or less), medium (50%-80%), or easy (80% or higher) [14].
To use the bookmark method, an OIB that arranged items in the order of level of difficulty was prepared in advance. The OIB was produced for each subject area. For the item analysis, the level of difficulty and discrimination were calculated using the R program (https: //www.r-project.org/) by applying 2-parameter item response theory. The items were arranged based on the subject's ability, θ, with a correct answer rate of 0.67 for each item according to the OIB's production principle. The standard-setting panelists bookmarked the last item that the minimally competent person was expected to answer correctly in each OIB. The competency corresponding to the bookmark point indicated by each panel member was converted into the true score, and the median was determined as the final cut score. The radiologists produced OIBs for 4 subject areas, consisting of 90 items about radiation theory, 20 items about medical regulations, 90 items about radiation application, and 50 items about practical skills.
To apply the Hofstee method, the maximum cut score and minimum cut scores that would indicate competence and the maximum and minimum fail rates were investigated among the panelists, and the average value was used as the final value. Based on the results of the national exam, the cumulative distribution of the fail rate according to the examination score was derived, and the point of intersection with the final score was determined as the cut score.
Statistics IBM SPSS ver. 25.0 (IBM Corp., Armonk, NY, USA) was used for the descriptive and correlation analyses, and R ver. 3.4.3 (https:// www.r-project.org/) for the item response theory analysis [15]. For the item analysis, the level of difficulty and discrimination were calculated with R by applying 2-parameter item response theory.

Definition of a minimally competent person
A minimally competent person was defined as a person who has only worked for 1 day after obtaining the license, and the content of the items and the expected correct answer rate were determined accordingly.
Comparison of cut scores between cut score setting methods Table 3 summarizes the results of applying the Angoff, Ebel, bookmark, and Hofstee methods for the KRTLE. Based on a total score of 100, the cut scores assigned by the radiologists were 71.27 using the Angoff method, 62.2 using the Ebel method, 62.49 using the bookmark method, and 62 using the Hofstee method (Appendices 1-4). The cut scores according to the Ebel and bookmark methods were similar, but those according to the Angoff and Hofstee methods were significantly different. For radiologists, the cut score according to the Ebel method was similar to those according to the bookmark and Hofstee methods.   Relationships between standard-setting methods Table 4 shows the results of confirming the reliability of the rating methods using a correlation analysis by classifying subjects who passed and those who failed according to each cut-off score. The Ebel and Hofstee methods showed similar scores, so the passing and failing rates were similar, too.

The reliability of the rating method
The reliability of the rating method was confirmed using a correlation analysis by classifying subjects who passed and those who failed according to each cut-off score ( Table 5). For the radiologists, the correlation between the Ebel method and the Hofstee method was very high (0.983), as was the correlation between the bookmark method and the Hofstee method (0.917).

Feedback on the standard-setting method
At the end of the standard-setting workshop, a survey was conducted of all 6 panelists. The results for the most appropriate standard-setting method for the national examination were as follows: Ebel, 57.1%; Angoff, 28.6%; Hofstee, 14.3%; and bookmark, 0%. An advisory council was held to present the workshop results and the suggested methods for standard-setting for the KRTLE, along with a discussion about these methods. Four panelists attended, discussed the issues, and decided that they agreed with the suggested standardsetting model proposed in this study and the item subsets (Fig. 1).

Suggestion of a standard-setting method
The final proposal for a standard-setting method is shown in Fig.  1. In the first step of the standard-setting method, the modified Angoff or Ebel method is used, and in the second step, the Hofstee method is used to check whether the proposed standard-setting method presents an acceptable range of cut scores and fail rates for the national examination. The Hofstee acceptable cut score and fail rate range will not be absolute, but can be used as a reference (Fig. 2).
When using the Angoff method, a modified model that uses test information to set standards will help reduce variation across panelists. For the Ebel method, the test information should be examined, and methods of utilizing the actual level of difficulty should be compared.
Although did not attempt to do so in this study, based on the literature, we suggest that all items should be rated due to the nature of national examinations, and that items should be allocated into sub-sets according to test subject, test period, and item information. It is appropriate to allocate items according to item information, such as level of difficulty and discrimination.

Discussion
Standards for the KRTLE were set using the Angoff, Ebel, bookmark, and Hofstee methods. The Ebel and Hofstee methods showed the most similar results, and the cut score according to these 2 methods was also most similar to the current standard of the national examination (a score of 60). Since the cut score of the national examination is fixed, the examination committee members consider the fixed score when developing or organizing national examination items. In other words, the Ebel and Hofstee methods showed the most similar results when assuming that the items were created according to a passing score of 60. The Ebel method comprehensively takes into account the relevance of the items, the expected percentage of correct answers of the minimally competent person, and the percentage of correct answers on items with similar relevance and a similar expected percentage of correct answers by borderline examinees. Thus, the procedure is complicated, but the results were similar to the actual cut-off scores. In this study, the modified Angoff method, which refers to the information of the actual items to set the standard, was not applied. Thus, the cut-off score according to the Angoff method was different from the other cut-off scores.
The standard-setting method proposed in this study is to rate items using the modified Angoff or Ebel methods in the first step and then to confirm the acceptable cut score and fail rate using the Hofstee method. The modified Angoff method, which is the most commonly used method of setting a cut score, and the Ebel method, which yielded relatively stable results in this study, can be applied to obtain the cut score. Then, the Hofstee method is used to examine whether the result is acceptable considering the maximum and minimum ranges of the cut score and fail rate. For the Qualifying Examination part II, which is a practical skill test for doctors in Canada, the cut-off score is calculated using contrasting groups and the borderline group method, and effect of the result is considered through the Hofstee method [16].
While all the panelists evaluated all items in the existing method, we propose the use of item subsets, a partial rating method in which  panelists divide the entire set of items and rate them. In this study, partial rating with item subsets was not carried out. However, rating requires considerable time and effort, so if panelists are appropriately trained, the entire item set should be divided and then allocated to panelists. Thus, reviewing and rating only a subset of items would increase the efficiency of the panelists, while maintaining reliability. The panelists who participated in the workshop also mentioned that partial evaluation would be more effective if a sufficient discussion on common items was held. Ferdous and Plake [11] in 2007 set the standard for the 'No Child Left Behind' in the United States and asked the panelists to evaluate only some items based on a consideration of their fatigue, which could reduce reliability. This study is significant, as it applied various standard-setting methods to the KRTLE beyond the existing fixed cut score and proposed a method of combining standard-setting methods for the first time.