How to grade items for a question bank and rank tests based on student performance

Introduction: Not utilising post-test psychometric analyses of questions, and not maintaining a question bank seemed to adversely affect the quality of tests and increase the workload of academicians, as they are required to write fresh questions for all examinations. The literature review did not reveal any gold standard method for creating a question bank. This study has formulated criteria for recruiting multiple choice questions for a question bank and introduced a formula to rank whole tests. Methods: We collected used question papers of multiple true-false questions (MCQ) and one best answer questions (BAQ) and got two experienced academicians to scrutinize them and identify items with flaws. The flawless items were counted in each test and their test performance index (TPi) determined. The psychometric item analysis reports of the tests were analysed to enlist bankable items. The TPi of the tests were also calculated by this method. The TPi derived by expert opinions were compared with those obtained by objective criteria using the Pearson Correlation Coefficient and the Spearman’s rho. Results: Judgements by two experts showed a positive correlation, so also expert judgements against the objective formulae. Omission rates in MTF items showed a highly significant negative correlation with difficulty index, falling short of a perfect -1, which supported including omission index in their triple formula. The mean number of functioning distractors (FD) per BAQ item was1.87 (SD 1.14), which supported ≥2 FD per item in their triple formula. Conclusion: Expert judgments in question vetting is essential. However, objective post-test scrutiny of items using difficulty and discrimination indexes enhanced with omission index and distractor efficiency to recruit items for question banks are required. Test Performance Index will be a useful metric to rank the tests.


Introduction
Teaching and assessing medical students' performance have always been a challenging job. Many assessment modules have been developed over the years in order to evaluate students' learning correctly (Jones, 2005). Multiple choice questions with true/false options (MTF / MCQ) and best answer questions (BAQ) have been used for many years to assess medical students, and their reliability and validity tested (Musa, 2017;Suryadevara and Bano, 2018). In our setting, questions are constructed afresh by the academicians every time the coordinators requested for them. Although item analysis reports were generated by the optical mark reader (OMR) every time MCQ and BAQ tests were scored, they were seldom utilised for evaluating the questions post-test. Not having a question bank not only increased the workload of the academicians but also overlooked the psychometric indexes to evaluate the quality of questions and to rectify the flaws in them. Currently, the difficulty index (DIFi) and the discrimination index (DISi) are widely used to judge the quality of MCQ and BAQ (Ebel and Frisbie, 1991;Considine, Botti and Thomas, 2005). Additional information provided by item analysis are the rate of omission of each of the five options in MCQ and the uptake of each distractor in BAQ. DIFi indicates the proportion of test-takers who answered the item correctly, and DISi, how well the item discriminated high scorers from low scorers (Ebel and Frisbie, 1991;Considine, Botti and Thomas, 2005). An omission in MTF indicates that either the candidate did not know the answer or refrained from attempting for fear of incurring a negative score, which could be attributed to ambiguity or confusing elements in the item. This issue is important as our faculty has followed the traditional method of negative scoring for wrong answers in MCQ. Distractor efficiency is another important indicator in BAQ, which was not given sufficient importance in our setting. Our faculty used 5-option BAQ, but having noticed the presence of many non-functioning distractors in the post-test item analysis, switched to 4-option items recently. A distractor is considered ineffective or not functioning, if it is not chosen as the answer by at least 5% of the test-takers (Nunnally and Bernstein, 1978;Burud, Nagandla and Agarwal, 2019). If a BAQ item does not achieve at least two functioning distractors (FD), it is strictly not a best answer question in effect, but more like a one correct answer question (Burud, Nagandla and Agarwal, 2019).
Expert vetting of questions before examinations is the rule. The question author, as well as the vetting committee, would pass the questions as 'perfect' in all aspects. However, post-test item analyses often reveal many subtle flaws in them. This fact points to the need for using post-test analysis, besides expert vetting, to recruit items for the question bank. We considered the currently used item qualifying yardsticks of DIFi and DISi not stringent enough and planned to include omission index (OMi) and distractor efficiency (DE) as additional metrics for qualifying items for the question bank. Considering this view, we proposed a triple formula each for MCQ and BAQ over the existing dual formulae to standardise grading of items for the question bank, and another formula called Test Performance Index (TPi) to rank the entire tests themselves based on the post-test analysis (Table 1).

Methods
All the materials used in this study belong to the Faculty of Medicine and Health Sciences, Universiti Malaysia Sarawak. We obtained the permission of the faculty's Dean through Deputy Dean (Academics) to use them for this study. The materials available were: question papers, student scores and item analysis reports of 24 MCQ and 22 BAQ tests used for end of posting/block examinations, and additional item analysis reports of 6 MCQ (60 items each) and 7 BAQ tests (50 items each) used for professional examinations (Table 2, 3). A pair of experienced academicians independently scrutinised the 24 MCQ and 22 BAQ question papers for flaws in them. Items with flaws such as lack of clarity, ambiguity, confusing expressions in the stems and options, and those too easy or too difficult were counted out. TPi of each test was calculated by dividing the number of flawless questions by the total number of items in the test. The item analysis reports of the professional examinations were used to study the impact of adding omission index and distractor efficiency in the triple formulae (A) beside the difficulty and discrimination indexes used in the dual (B) formulae. DIFi, DISi and the number of items not answered (omissions) in each MCQ were automatically generated by the optical mark reader (Smart Scan) while scoring the tests. We calculated the omission index (OMi) of MCQ items using the following formula: total number of omissions in an item ÷ (5 x number of examinees), 5 being the number of options in an item. A distractor in BAQ chosen as the answer by 5% or more examinees were counted as FD. We labelled items qualifying by formulae A 'grade A' and those qualifying by formulae B from the remaining items 'grade B'. We calculated TPi of all the tests by dividing the number of qualifying items (grades A+B) by the total number of items in the test (Table 2, 3).
We compared the TPi of all the tests derived by different methods using Pearson's correlation coefficient and the Spearman's rho with SPSS version 22 to validate the formulae (Table 4, 5, 6). Fifty per cent, fixed by the faculty, was taken as the tests' passing score. The students' overall passing rate in each test was also examined for its relationship with the quality of the tests apart from the psychometrics. DIFi of 0.31 -0.79 was taken as moderate difficulty and DISi of ≥0.15 as fair discriminating power for all items (Ebel and Frisbie, 1991). Omission index of <0.25 and distractor efficiency of 2 or more FDs/item were set as the criteria for the formulae A (Table 1).

Results
Lower the difficulty index of the item (the more difficult the item), the higher the omission index. Lower the omission index, the higher the discrimination index. The more the number of functioning distractors, the lower the difficulty index (more difficult the item). The more the number of functioning distractors, the higher the items' discrimination power (Table 4). In MCQ papers there was a positive correlation, not statistically significant, in TPi derived by the two experts' judgement. There was a significant correlation in TPi derived by expert 1's judgement and those derived by the formulae, while there was an insignificant correlation between TPi derived by expert 2's judgement and those calculated by the formulae. There was a negative correlation, statistically not significant, between students' passing rates and the TPi calculated by the formulae, meaning the passing rates become slightly lower as the quality of questions improved (Table 5). In BAQ papers there was a positive correlation, not statistically significant, between TPi derived by the two experts' judgements. There was a statistically significant positive correlation between TPi calculated by the expert 1's judgement and those derived by the formulae, while there was an insignificant correlation in the same for the expert 2. There was a statistically significant negative correlation between students' passing rates and the TPi derived by the formulae, meaning the passing rates in BAQ also became lower as the quality of questions improved (Table 6).

Discussion
Generally, DIFi and DISi (components of formulae B) are used to evaluate MCQ and BAQ items post-test. We introduced the omission index in MCQ and the number of FDs per item in BAQ as additional indicators and proposed the triple formulae (A) to make recruitment of items for the question bank more stringent. We used Test Performance Index (TPi), a direct reproduction of the number of items recruited to the question bank, to rank the tests and used their values for all the comparisons (Table 2, 3). TPi may range from 0 to 1, zero indicating nil and one indicating all the items in the test being recruited. We believe that generating such an index to rank the tests by their performance in the examination would make the examination coordinators and question authors aware of the quality of their products, and thereby encourage them to take remedial measures to improve the items. The variables likely to affect the quality of items, and thereby the TPi of the tests, include the questions or the teaching being not concordant with the syllabus; the test being too short and not covering enough topics affecting the reliability and validity of the test (Tarrant and Ware, 2010); the options in an MCQ being not undisputedly true or false; the stem in a BAQ being not conclusive; the distractors being not effective enough to distract the test-takers; and the quality of the examinees.
The inverse correlations between OMi and DIFi as well as between OMi and DISi in MCQ, which were far below a perfect -1, justified the inclusion of OMi as an additional indicator of the quality of MCQ (Table 5). BAQs having more FDs would make them more difficult for the examinees but higher in their discrimination power. The Nonfunctioning distractors (NFD) in a BAQ would not only fail to perform their function of distracting poor performers, but would lead them to the answer by elimination. Several NFDs in an item makes it too easy and weak in discriminating power (Burud, Nagandla and Agarwal, 2019). Therefore, we used ≥2 FD per item as the third indicator for formula A in BAQ ( Table 6).
TPi of MCQ and BAQ by experts' judgement, although correlated positively with each other and with the TPi by the formulae, were not consistent in their relationship (Table 2, 3, 5, 6). This indicated subjectivity in the experts' judgement, and pointed to the need for an index with objective criteria based on the psychometrics of the test. TPi of both MCQ and BAQ tests showed a negative correlation of no statistical significance with students' passing rates. It showed that students' passing rates could reflect both the difficulty of the test and the quality of examinees. Literature has shown that students' scores cannot be taken as a good indicator of the quality of assessments (Grissmer, 2000). Therefore, we did not use it as an indicator in this study. Our study pointed to the fact that question vetting by content experts, even if rigorously implemented, could not be infallible in eliminating flaws in the items. How the examinees read and interpret the items could be different from the views of the experts. When the items and their individual options are scrutinized again in the light of the item analysis, flaws often show up (Hingorjo and Jaleel, 2012). All our findings highlighted the importance of post-test analysis of the items and the need for grading the items and tests objectively to maintain a question bank and to eliminate their flaws for future use. Our study has the limitation of being confined to a single faculty and the formulae being newly introduced. Their utility and validation will be enhanced, when other faculties also try the question bank formulae and experience their effectiveness.

Conclusion
In the absence of existing gold standard formulae to grade items and recruit them for a question bank and to rank MCQ and BAQ tests, this study has introduced and evaluated formulae (A, B and TPi) for this purpose, and created a flowchart for ready reference (Supplementary File 1). Maintaining a question bank is expected to improve the quality of tests as well as ease the work of question authors. Scrutinizing and modifying items in the light of item analysis is an essential step to improve the tests, although expert judgement and multidisciplinary vetting are of irreplaceable value. TPi will be a valuable metric to rank and compare the tests. It is expected that more studies in this line will follow to endorse these formulae and strengthen their validity.

Take Home Messages
Maintaining a question bank will improve the quality of tests and reduce the workload of academicians Utilising the post-test item analysis to grade questions and to improve them are worth the effort Calculating the Test Performance Index of MCQ tests to rank them will make examination coordinators and question authors aware of the quality of the tests The omission rates in MCQ T/F items with negative scoring is an additional indicator of their quality Distractor efficiency of BAQ items is a valuable indicator of item quality worth utilising

Notes On Contributors
Dr. Thomas Puthiaparampil is a professor of medicine and coordinator of year-3 medicine clinical posting for more than a decade at the University Malaysia Sarawak, who has been involved in question construction, vetting and post-test item analysis all those years, and also actively involved in the Medical Education unit of the medical faculty. ORCID ID: https://orcid.org/0000-0002-7939-5185 Dr. Md Mizanur Rahman is a professor of community medicine and public health. He is the coorninator of Biostatistics and Public Health Research and working more than ten years at the Universiti Malaysia Sarawak. He is mostly involved in teaching and research, especially social and public health issues. ORCID ID: https://orcid.org/0000-0002-2353-6823 Dr. Henry Rantai Gudum is a professor of haematology and a clinical coordinator for the Blood and Immunology block of phase I medicine programme. ORCID ID: https://orcid.org/0000-0002-4028-4584 Dr. Imam Bux Brohi is Associate Professor in the department of Family Medicine and coordinator for year four medical students for 3 years and has been involved in creation and vetting of questions during this period. ORCID ID: http://orcid.org/0000-0001-7574-2154 Dr. Isabel Fong Lim is a senior lecturer in the field of Medical Microbiology, a member of Medical Education Unit, Subject Editor of 'Trends in Undergraduate Research' Journal.and Phase 1 (Preclinical Year) Academic Advisor Coordinator at the Faculty of Medicine and Health Sciences, Universiti Malaysia Sarawak. ORCID ID: https://orcid.org/0000-0001-6731-9479 Dr. Rosalia Saimon is a senior lecturer in the department of Community Medicine and Public Health, coordinator of family Health posting of undergraduate medical students, and actively involved in Medical Education Unit of the medical faculty.