Quality assessment of a multiple choice test through psychometric properties

Introduction: Instruments for Multiple Choice Questions (MCQs) assesment in health sciences must be designed to ensure validity and reliability. The present paper assesses the quality of a MCQs test in Research Methodology at the Faculty of Medicine at Xochicalco University. It establishes the basis to improve the quality of MCQs and is intended as a baseline for other universities. Methodology: The peer-reviewed test had 20 MCQs with three distractors with and a single correct response and 89 students took the exam. The tests were graded and analyzed to estimate difficulty index (DIF I), discrimination index (DI) and distractor efficiency (DE). Cronbach’s alpha and ANOVA were calculated with SPSS. Results: The resulting DIF I (0.49) indicates that the test was moderately difficult, mean discrimination index of 0.25 means that the test has a regular quality to discriminate between skilled and unskilled students and needs to be checked, only 20% of the items were considered excellent and 5% were considered good questions. Alpha coefficient was 0.898 considered good for a MCQs assesment. ANOVA results showed no significant differences between groups. Discussion and Conclusion: This test shows a high percentage of moderatly difficult questions, is unbalanced and a reduction of Non Functional Distractors (NFDs) is needed. However the topic learned has the same standard across the different groups of students.


Introduction
Empirically developed learning assessment tools limit the proposed educational goals and do not objectively reflect the level of achievement through the grades obtained (Ortiz-Romero et al., 2015). Since the incorporation of multiple choice questions MCQs) for medical testing by the National Board of Medical Examiners (Hubbard, 1978), several guidelines have been created in order to properly construct these kind of questions and to evaluate higher cognitive processing (Haladyna, Downing and Rodríguez, 2002). Research groups mainly from areas of educational psychology and educational assessment focused on the study of MCQs test (Hancock, 1994, Martínez, 1999 to validate the theoretical intentions in each learning domain. MCQs tests have some advantages over other assessing strategies: they allow the educators to cover a wide range of educative material that was given in a short period of time and they are especially useful evaluating large population of students. Moreover, MCQs test allows the construction of multiple test versions to control cheating and the academic feedback returned to the students becomes easy (Simkin and Kuechler, 2005). On the other hand, statistical information can easily be obtained in order to determine the class performance on a particular question and to asses if the question was appropriate to the context that was presented (Carneson et al., 2016).
Instruments for MCQs assesment in health sciences (Durante et al., 2011) must ensure validity and reliability to be scientifical and practical. Validity is the degree to which it measures what it is supposed to measure and reliability is a statistical concept that represents that the scores obtained by the students will be similar if they were assessed again; it means that the construct measured is consistent across time. Reliability can be measured by a statistical method for internal consistency called Cronbach's alpha. It ranges from 0 to 1, where zero is a no-correlation between the items and one represents a significant covariance. Is suggested that a Cronbach's alpha coefficient of 0.8 or more is desired for high-stakes in-house exams and that reliability can be improved by increasing the number of items given in an exam (Ali, Carr and Ruit, 2016).
MCQs are characterized by high validity and reliabity if they are apropiately constructed (Ware and Vik, 2009). Even though medical schools must review the quality of MCQs to ensure their validity and reliability, in México, only the Faculty of Medicine of the National Public University (UNAM) has published these kind of studies (Delgado-Maldonado and Sánchez-Mendiola, 2012;Martínez, et al., 2014;Saldaña, Delgadillo and Méndez, 2014;Borrego-Mora and Santana-Borrego, 2015). These were done to gather enough evidence for the validity and reliability of different assesments using different psychometric analyses. Through them, they were able to identify flaws in their MCQs, to feedback faculty programs and improve the length and quality of their high stake exams.
Item analysis show which questions are good and those that need improvement or to be discarded (Mitra et al., 2009), based on difficulty and dicrimination indexes and the distractor efficiency (Sahoo and Singh, 2017). Difficulty index (DIF I) describes the percentage of students who correctly answered the item. It ranges from 0-100%. The higher the percentage, the easier the item and viceversa (Sahoo and Singh, 2017). The recommended range for an ideal difficulty balanced exam is 5% for easy items, 5% for difficult ones, 20% for moderatly easy, 20 % for the moderatly difficult and 50% for the average ones (Backhoff, Larrazolo and Rosas, 2000). Despite the importance of MCQ tests, no recent publications have been found about this topic. Discrimination index (DI) explains the ability of an item to differentiate between high and low scorers. It ranges between -1.00 and +1.00. Items with higher value of DI better discriminate between students of higher and lower abilities. A highly discriminating item indicates that the students who had high tests scores got the item correct whereas students who had low test scores got the item incorrect (Boopathiraj and Chellamani, 2013;Sahoo and Singh, 2017). When a distractor (DE) distracts few or no testee, it means that it is a poor distractor and should be reviewed (Odukoya et al., 2017). An effective distractor is the one chosen by ≥5% of the students (Kaur, Singla and Mahajan., 2016;Sajitha et al., 2015;Ware and Vik, 2009). DE is determined for each item on the basis of the number of nonfunctional distracter (NFD), if the option is selected by <5% of students (Kaur, Singla and Mahajan., 2016).
Another psychometric property measured is alpha if item is deleted (Pell et al., 2010), this measure indicates that since Cronbach's alpha tends to increase with the number of items in an assessment, the resulting alpha if one item is deleted should be lower than the overall alpha score, if this item has performed well. Where this is not the case, this may be caused by any of the following reasons: the item is measuring a different construct to the rest of the items or the item is poorly designed or the assesors are not assesing to a common standard or there are teaching issues (the topic tested has not been well taught or has been taught to a different standard across different groups of students).
For medical schools it is important to guarantee the quality of education and therefore the quality of MCQs that are applied. The Faculty of Medicine at Xochicalco University in Ensenada, Baja California, México, has a group of Research Methodology teachers that have been working since 2017 in peer-reviewed MCQs. The objective of this research was to assess the quality of a MCQs exam in a course of 2nd semester of Research Methodology subject in 2019-1, which represents 40% of the grade of each partial evaluation. This research will establish the basis to improve the quality of MCQs and the results will be proposed as a base line for the scientific analysis of multiple choice tests within the faculty itself in its different areas, as well as in other medicine faculties.

Methods
To create the test, the academic program content corresponding to the research methodology subject was divided between lecturers of the 2nd semester and each one provided MCQs to ensure that all the topics were covered. In order to have a consensual evaluation of the test, each item was peer-reviewed and the grammatical or syntax errors were corrected. The test had 20 MCQs with three distractors and a single correct response. A total of 89 students took the test the same day. There were two different versions of the test with items in different order, with topics such as problem statement, causality study and protocol design. The different tests were graded, data obtained was entered in Microsoft Excel 2016 and analyzed to calculate DIF I, DI and DE.
The following formulas of Backhoff et al. (2000) were used to find DIF I and DI: The criteria used to categorize the difficulty and discrimination indexes are presented in Table 1   The criteria used to categorize the distractor efficiency is presented in Table 3. The Cronbach's alpha, the resulting alpha if item deleted scores and one-way analysis of variance (ANOVA) were calculated by the Statistical Package for Social Sciences (SPSS) 22. To analyse the differences between mean groups A, B y C, an ANOVA P value of <0.050 was considered statistically significant.
Ethics approval was granted by our medical school Ethical Committee.

Reliability Coefficient and alpha de Cronbach if ítem deleted
The alpha score of the assesment was 0.898. The resulting alpha if item deleted, that were below the overall alpha score, were 75% of the items. Questions 1, 7, 11, 12 and 14, (25% of the exam) obtained the same score or above 0.898. Questions 2,6,8,9,13,14,17 and 18 got a low discrimination index, but their alpha if item deleted were below the alpha score. Only item 14 has a low discrimination index and a score above the Cronbach alpha's, was too

ANOVA
The averages obtained in two of the groups were accredited (upper than 70) and in the third group not accredited, however, the ANOVA did not show significant differences between the averages of the three groups (Table 6).

Discussion
According to Backhoff et al. (2000) the mean DIF I (0.49, S.D.=0.18) of the exam applied for the present research, means it was moderately difficult. Even more, 55% of the items are classified as moderately difficult. The mean DIF I is lower compaired to that obtained for MCQ test by Backhoff et al. (2000) in a high stake examination (DIF I=0.56), and also lower compaired with Rao et al. (2016) (DIF I= 0.75) for a pathology MCQ test.
Licona-Chávez A, Montiel Boehringer P, Velázquez-Liaño L MedEdPublish https://doi.org/10.15694/mep.2020.000091.1 Page | 8 Results of DIF I differ from the criteria suggested by Backhoff et al. (2000) for a balanced exam, since a higher percentage of moderately difficult questions (55% vs 20% expected) and moderately easy (15% vs 20% expected) was obtained. The items with average difficulty were lower than expected (10% vs 50% expected) and the items easy (5%) were as expected. However, the exam did not have the 5% of difficult items suggested by Backhoff. These results indicate that some questions must be reviewed in order to balance the exam according to the suggested criteria.
Mean discrimination index was 0.25 (S.D. 0.16) which means that the test has a regular quality to discriminate between skilled and unskilled students and needs to be checked. None of the items showed a very poor DI, which is an advantage because a negative DI value indicates the presence of ambiguous questions or questions with the answer wrongly marked, so none of the analyzed items were definitely discarded. According with the criteria used, 20% of the items were considered as excellent and 5% were considered good questions, indicating that they discriminate among students. Nevertheless, 75% of the items could be rejected due to their poor discrimination power (<0.20), so deep evaluation of these items should be conducted. Other studies (Kaur, Singla, and Mahajan, 2016) that use similar criteria to interpret results, discarded all the questions with DI under 0.20 without further evaluation. On the other hand, seven of the eight items with poor DI were considered moderatly difficult according to their DIF and six of them had at least one NFD. It has been seen that the number of NFD disturb the discrimination power of the MCQs, items with lower NFDs correlate with good or excellent DI (Rao et al., 2016;Kheyami et al., 2018).
Among the different parameters used in this study, the DI is the most accurate because it takes into account all the questions as well as all the students for the evaluation (Saldaña et al., 2014). In this study, 20% of the items were kept for subsequent use and the other 80% still required further improvement.
Even though this test has only 20 MCQs, it's alpha coefficient is 0.898 which is good for a MCQs assesment. The high consistency in this first analysed exam was probably because the questions were peer reviewed. Tavakol and Dennick (2012) suggested in the AMEE guide 66 the use of alpha-if-item-deleted for high-stake examinations. Although the Cronbach's alpha if item deleted is not commonly used to measure MCQs quality, in the present study it was helpful to measure the individual items. The Cronbach's alpha if item deleted was the same or above overall alpha score in 5 items (1, 7, 11, 12, and 14), which means that if they are taken away, the exam will get the same or a better Cronbach's alpha. Item 14 had three metrics that indicate it as a flaw question: a low discrimination index, an above Cronbach's alpha if item deleted and one NFD. There were also items with 1 or 2 NFDs whose Cronbach's alpha if item deleted was below 0.898, although it is known that careful creation and selection of distractors reduces the cueing effect and improves the MCQ tests (Ali et al., 2016).
ANOVA results obtained show no significant differences between groups. This could indicate that although each group has a different teacher, learning level seem to be similar between groups, probably due to common academic planning and schedules between Research Methodology teachers in our medical faculty.

Conclusion
This paper is the first psychometric analysis carried out of a MCQ test in a Medical school in northwest Mexico. This MCQ exam showed a high percentage of moderatly difficult questions and lacked of average questions, resulting in an unbalanced test, even though, the items were previously peer reviewed. Further analysis must be done in order to increase the percentage of MCQs with an average DIF I. On the other hand, the test has a low discrimination index, only 20% of the items can be kept, the rest need deeper evaluation. To improve the discrimination power of these MCQs, a reduction of NFDs will be needed. It also shows the advantages of the phsycometric analysis of MCQs test examinations which facilitates the faculty feeback to improve their MCQs and Licona-Chávez A, Montiel Boehringer P, Velázquez-Liaño L MedEdPublish https://doi.org/10.15694/mep.2020.000091.1 Page | 9 distractors in order to develope a validated question bank. The ANOVA test was useful to show that even though the teaching methods are different on the same topics, the topic learned has the same standard across different groups of students.

Take Home Messages
Quality assessment of a multiple choice test (MCQs) delivers validated questions and allows to form a trustworthy question bank. Multiple choice questions used for assessment have a better design when they are peer-reviewed. The measurement of psychometric properties of multiple choice questions (MCQs) gives a guide in where their flaws are, in order to improve them.

Notes On Contributors
Dr Ana Livia Licona-Chávez coordinates the medical research area since 2008, has been teaching for 12 years.
Dr Pierangeli Kay-to-py Montiel Boehringer has been a general practicioner since 1993, has been teaching for more than 30 years and has done medical education research since 2013.
Prof Lupita Rachel Velázquez-Liaño has been teaching since 2014, now is responsible for all the laboratories in this medical school.