Test Equating of the Medical Licensing Examination in 2003 and 2004 Based on the Item Response Theory

The passing rate of the Medical Licensing Examination has been variable, which probably originated from the difference in the difficulty of items and/or difference in the ability level of examinees. We tried to explain the origin of the difference using the test equating method based on the item response theory. The number of items and examinees were 500, 3,647 in 2003 and 550, 3,879 in 2004. Common item nonequivalent group design was used for 30 common items. Item and ability parameters were calculated by three parametric logistic models using ICL. Scale transformation and true score equating were executed using ST and PIE. The mean of difficulty index of the year 2003 was -0.957 (SD 2.628) and that of 2004 after equating was -1.456 (SD 3.399). The mean of discrimination index of year 2003 was 0.487 (SD 0.242) and that of 2004 was 0.363 (SD 0.193). The mean of ability parameter of year 2003 was 0.00617 (SD 0.96605) and that of year 2004 was 0.94636 (SD 1.32960). The difference of the equated true score at the same ability level was high at the range of score of 200-350. The reason for the difference in passing rates over two consecutive years was due to the fact that the Examination in 2004 was easier and the abilities of the examinees in 2004 were higher. In addition, the passing rates of examinees with score of 270-294 in 2003, and those with 322-343 in 2004, were affected by the examination year.

ference in the item difficulty or by the difference in the abilities of examinees. To answer, we tried to apply test equating. Equating is the statistical process used to adjust scores on test forms so that scores can be used interchangeably [1]. Equating adjusts for differences in difficulty among forms that are built to be similar in difficulty and content. There are several methods of equating two or more tests. Generally, five steps are suggested: First, choose a data collection design. There are two classes of equating designs. First class contains single group or randomly equated group design. Second class contains designs for which the assumption of randomly equivalent groups may not hold. There are three nonequivalent group designs, i.e. common item (anchor item) nonequivalent groups design, preequating nonequivalent groups design and post-equating nonequivalent groups design. Second procedure is to get the parametric values such as difficulty index, discrimination index based on the classical test theory or item response theory. Third, common item is selected from two tests with same contents and same format. Fourth, equivalent constant is calculated by scale transformation. The item and ability parameters of two tests are as follows.
To compute the equivalent constant based on the item response theory; there are regression method, mean and sigma method, robust mean and sigma method, and item characteristic method [2]. Fifth, score equating is possible with common scale ability parameter of two tests. There are true score equating and observed score equating methods for this purpose. If only ability parameters are compared, this last step is not necessary [1].
Out of above procedure, we tried to apply test equating to Medical Licensing Examination for years 2003 and 2004 based on the item response theory that less affected by the ability of examinees. Specifically, we pursued-First, to compare the item parameters post-equating, second to compare the ability parameters post-equating, third, to compare the true score after score equating procedure. Besides answering the origin of the difference in passing rates, this is meaningful not only to suggest the method of comparing yearly results of the Examinations, but also to provide the basis for solving the problem of fluctuation in passing rate every year. Examinees and item contents of two Examinations are not identical. Although there were no same items on the two Examinations, these can be said to be alternate forms since the subjects are the same and the item difficulty index points to an equal level. The common items should be assumed for the two Examinations to equate them. Therefore, common-item nonequivalent design was applied.

2) Estimation of item and ability parameters based on item response theory
The Item Response Theory Command Language (ICL) was used for the estimation of item parameters and ability parameters based on item response theory [3].

3) Selection of common items of two tests
Common items were selected when the contents, knowledge level (recall, interpretation and problem-solving) are identical. Out of them, the items of which difference of difficulty parameters of two years is less than 1, were finally selected. The number of common items was 30.

4) Computing equating constant by scale transformation
We computed the equating constant by scale transformation using a computer program (ST) for IRT Scale Transformation [4]. Two categories of techniques for computing the scale transformation functions are computed in ST: 1) techniques based on the mean and standard deviation of the item parameters; and 2) techniques based on minimizing a loss function involving item characteristic curves.

5) True score equating
We used a computer program for IRT Equating (PIE) to calculate the true score equating that is easier to compute and that does not depend on ability parameters [5]. There is also method of observed score equating. Since there was a report that two methods produce very similar results in a study using the common-item nonequivalent groups design in the SAT, we just tried true score equating [1].    The test equating through common items makes it available to compare the test scores each year, but it is not possible to be compared directly. It also provides the data on the validity of the passing criterion that is inevitably affected by the item difficulty and ability of examinees. We searched the PubMed (http://pubmed.org), Web of Science (http://isi01.isiknowledge.com/) and Google (http:// google.com) to ascertain if there are any comparable equating data of high stake examinations to this paper. No data on the Medical Licensing Examination was searchable. Only Medical College Admissions Test (MCAT) was equated using several methods that mentioned that the item response theory is useful for MCAT [6]. The lack of data on this topic may be due to difficulty of the selection of common items since not every Institute wants to use exposed items.

Change of item parameters after scale transformation
There are some limitations of this work. First, we cannot use completely-identical common items so that the professionals determined the common items based on the content and knowledge level. Second, the number of common items is short. Thirty is the minimum number of common items when the number of total items is over 150. Since the item numbers are 500 and 550; the number of common items, 30 is the minimum for test equating. Therefore, there may be a possibility of bias originated from the characteristics of common items. In this situation, it is reasonable to say virtual common items instead of common items. There is no available previous work on the virtual common items like ours. In real situation, where completely-identical common items are not available, the introduction of the virtual common items and its results of analysis is another task to be solved.
In the criterion-referenced test, the stable difficulty level of  the test is essential. However, if the fluctuation of the passing rate originated from the difference of the abilities of examinees, it is not plausible to say that the difficulty level was not properly arranged. This equating procedure might be invaluable to compare the results of Examinations although they are the alternate forms. If the score can be reported after scale transformation and the equivalent passing score is set, the validity of the test can be obtained more reasonably. The results obtained in this study can be a basis for the computerized adaptive test to input the item parameters of each years Examinations. Test equating can be applied to the elective tests or the objective structured clinical examinations. Work toward the more stable establishment of test equating of Medical Licensing Examination are to continue the yearly comparison of item parameters and ability parameters, to compare the other methods of equating such as equi-percentile methods or observed score equating.