Item Analysis of Multiple-choice Questions (MCQs): Assessment Tool For Quality Assurance Measures

Background: Integration of assessment with education is vital and ought to be performed regularly to enhance learning. There are many assessment methods like Multiple-choice Questions, Objective Structured Clinical Examination, Objective Structured Practical Examination, etc. The selection of the appropriate method is based on the curricula blueprint and the target competencies. Although MCQs has the capacity to test students’ higher cognition, critical appraising, problem-solving, data interpretation, and testing curricular contents in a short time, there are constraints in its analysis. The authors aim to accentuate some consequential points about psychometric analysis displaying its roles, assessing its validity and reliability in discriminating the examinee’s performance, and impart some guide to the faculty members when constructing their exam questions bank. Methods: Databases such as Google Scholar and PubMed were searched for freely accessible English articles published since 2010. Synonyms and keywords were used in the search. First, the abstracts of the articles were viewed and read to select suitable match, then full articles were perused and summarized. Finally, recapitulation of the relevant data was done to the best of the authors’ knowledge. Results: The searched articles showed the capacity of MCQs item analysis in assessing questions’ validity, reliability, its capacity in discriminating against the examinee’s performance and correct technical flaws for question bank construction. Conclusion: Item analysis is a statistical tool used to assess students’ performance on a test, identify underperformed items, and determine the root causes of this underperformance for improvement to ensure effective and accurate students’ competency judgment.


Introduction
Single or One Best Answer of Multiple-choice Questions (MCQs) is known as an item consisting of a stem with many options, generally three to five, one of them being the right option while the rest distractors.This form of assessment is used in many institutions due to its capability to significantly appraise curricula.It is an efficient and relevant tool to identify the strengths and weaknesses in student knowledge, reflection of educational methods and strategies, however, it needs time, effort, and skill to develop a high-quality one [1].
A well-build MCQ assesses higher cognitive tackles of Bloom's taxonomy like data interpretation, synthesis, and knowledge application more than testing facts recall alone.
The stem of the MCQs is a clinical case scenario that can adequately measure core competencies, the intended learning outcome (ILO), evaluating the power of students, give reliable feedback, and reform curricula [1][2][3].There are six hierarchically assortments of cognitive scope in Bloom's taxonomy that are arranged in ordered factions: knowledge, comprehension, application, analysis, synthesis, and evaluation.Tarran trivializes Bloom's taxonomy and creates two levels: K1 represents the fundamental knowledge and cognition; K2 embraces analyzing with implementation and analysis [4].Item analysis is a hokey and avail approach to assess the reliability and validity of test items, performed after the exam.It auditions the effectiveness of stem question and its distractors to enable the examiners to reconstruct/modify or delete questions before the creation of an exam bank for future tests [1][2][3][4].Item analysis shows the questions' difficulty index (DIF-I).Ditto assesses the question's capability to discriminate performance of good or poor students in the test, that is, the discrimination index (DIS-I) [1][2][3][4][5].Bona MCQs assess perception, effectiveness, and psychomotor scopes better than other assessment methods due to its objectivity covering many subjects, minimizing the assessor's alignment, and its comparative, reliable, conciliated, and easy netting [3][4][5].In addition, it is also a relevant method that measures any impairment or strengths of the examinee's knowledge, gaps in teaching methods, or strategies of the institute for better graduate outcomes.It provides a good chance to the staff members to stimulate them in building their MCQ construction skills needed for the clarity of exam questions.[2] The standardization tool characteristics can influence its credibility.MCQ designers ought to pay attention to the examination purpose and its content based on the examinee level, blueprint, and the minimum pass level (MPL).It should fit the purpose and consensus judgment with advantageous implementation.So meticulous evaluation is counseled.Maintaining the standards in medical schools is crucial for high educational excellence, patient safety, and total quality management needed for both historic and newly established colleges [5][6][7].
The authors' aim in this review was to accentuate some consequential points about psychometric analysis displaying its roles in evaluating MCQs, assessing its validity and reliability in discriminating the examinee's performance, and impart some guide to the faculty members especially juniors when constructing their exam questions bank.

Materials and Methods
Databases such as Google Scholar and PubMed were searched for freely accessed English articles published since 2010.Synonyms and keywords were also used in the search.The abstracts of the articles were first viewed and read to select suitable matches, and then full-text articles were perused and summarized.Finally, recapitulation of the relevant data was done to the best of authors' knowledge.

Results
In any educational institute, assessment is a way to measure supposed mastering of ILOs.It is particularly consequential in clinical college graduates for protected patient care and community needs.Hence, meticulous evaluation and education must be performed.Standardized assessment of students' performances involves measurement aspects that are peculiar of the statistical framework.This process consists of distinct phases, from the definition of the measurement objectives to the development of proper assessment tools, and the analysis of the results in terms of students' achievement [5].
It should match student's ability and items related to specific content domains.The development of a proper assessment method is a rather complex process that starts with the definition of item specifications and ends with the validation of the assessment method itself.It effectively measures the target competencies in a test, its content and format constraints, distractors plausibility, item difficulty, and test consistency.For this purpose, first, a pretest sample is given to an examinee, their responses are then analyzed and validated using psychometric methods before conducting the final exam [6].

Discussion
Item analysis is a conciliated and availed method to examine the reliability and validity of the pretested standardized examination items.It is conducted after the exam before banking questions for future tests [5,6].

Methods of item analysis
Different methods can be used to investigate the psychometric properties of tests and test items.Descriptive methods based on Classical Test Theory (CTT) and models belonging to modern Item Response Theory (IRT) were reviewed.Regarding the item level, the CTT model is a relatively simple methodology.It is the probative estimate of the examinee's success rate on each item.The CTT appraises reliability, difficulty, DIS-I, and the distractors' efficiency (DE) to check the appropriateness and plausibility of all distractors.The core of this theory is based on the functions of the true test score and the error of random measurement.On the other hand, the Rasch technique of IRT is more grounded to assess the examinee's success at the item level [7].IRT besides apprizing the test reliability, DI, and DE, assesses the exam global rating similar to Cronbach's alpha.Additionally, it checks the exam invariance that is conclusive for building exam banks with well-calibrated exam questions.Item standardization can be classified as follows [5][6][7][8]: 1. Relative approaches (norm-referenced): used for ranking the examinee when a predetermined rating of the examinee is wanted so that there is no fixed MPL and the level fluctuates in accordance with the examinee's overall performance.There are two types of Angoff; the original and the modified methods, both of which are used to decide the cut-off scores for the exam items.The original method needs subject experts' panel to decide the probability of a minimally competent student who DOI 10.18502/sjms.v16i3.9695can answer each item correctly.Each expert estimates the probability ranging from 0 to 1 for every question and then calculating the average portability as a final cut-off score.The modified Angoff needs test domain expertise and the probabilities choices are eight, e.g., 0.2, 0.3, 0.4, 0. 5, 0.6, 0.7, 0.8, or "do not know" [9].
Angoff method is a predetermined criterion-referenced and test-centered method.
The modified-Angoff method allows the panel's setter to discuss the cut-off score and the rating results.For this reason, the modified-Angoff method is used for licensure and professions certification tests.Since the standard-setting is a decision-making process, the criterion setting validity and rating consistency is evaluated by how the process is performed in accordance with the test principle.Evaluation of the standard-setting validity is influenced by internal and external issues.It is consequential to ascertain that all standard-setting activities and measures are done consistently [10].
In the Nedelsky method, three Subject Matter Experts (SMEs) are used for the standard-setting of MCQs to assess the probability of a borderline/minimally competent student who will rule out the incorrect options or distractors.The probability is calculated as the reciprocal of the remaining items which the borderline/ minimally competent students are not sure if it is correct or not.For example, a group of experts assess the probability of borderline /minimally qualified students who are expected to rule out two distractors in a four-options item question.The rating will be half (1/2 = 0.50).The cut-off score for the exam is determined by adding up the average Nedelsky values for each item [10,11].
The Ebel method needs subject experts to judge the difficulty and relevance level of each item in the exam.The panel examines each item to determine its appropriateness, difficulty or simplicity, its relevance, importance, and acceptability.Each item is categorized according to its difficulty and relevance level.Next, the panel experts assess the expected chances of a minimally competent student who can rule out item distractors.
Lastly, the number of items in each category is multiplied by the expected probability of correct answers, and the total results are added to calculate the exam cut-off score.
Relatively, this method is costly, time-consuming, and needs many standard experts setters.Digital soft wire is important to gather the responses.Backup by the criterionreferenced method is needed like borderline regression.It is widely used in high-stakes exams and if challenged, it can hold up in court [12].
Eclectic Hofstee method was developed in 1983 to address problems that resulted from predictions disagreement between criterion-and norm-referenced items.In this method, the standard setter answers four enquires and presumptions about the candidates who will write the test.Two of these queries are about their apt knowledge level (referred as k), while the other two are about the failure rate (referred as f); (1) What is the satisfactory maximum cut-off score, even if all of the examinees overreached it?(2) What is the acceptable minimum cut-off score even if all of the examinees do not achieve it?(3) What is the allowed maximum failure rate?(4) What is the minimally accepted failure rate?The first two questions assess the failure rates and range between zeros and a hundred percent; closer to 100% indicate test difficultly and hard for anyone to pass.The last two questions, however, are scored between zero and the total test items numbers, the higher the value, the more difficult the cut-off score [12].
Selection of a suitable psychometric approach is influenced by different factors.It varies depending on the intended goals/objective.In low-resource setting, the CTT psychometric method may be good enough.In a high-stakes exam, IRT and Rasch Measurement Theory must be used, and the final decisions will depend upon the quantitative and qualitative item results.You can select a suitable method according to the psychometric properties you want like the reliability, validity, suitability of item response, scaling assumption, and acceptability [13].

Reliability
The inherent concept is embedded within the CTT, reliability assesses the internal consistency of MCQs items [13,14].Reliability and validity are important for defining the result obtained to meet the requirements and measure bias.Reliability shows up to which level the assessments were consistent while validity assesses the assessment accuracy [15].Reliability-related concepts are internal consistency, stability, equivalence, and precision.Reliability depends both on the standard error of measurement and the standard deviation of the examinee's assessment.Regarding the internal consistency, the estimation depends on the item's average correlation for a test, also it estimates to which degree the MCQs can measure the same knowledge domain characteristics.
Typically, internal consistency is obtained by calculating the reliability coefficient.A reliability coefficient estimates the concordance between the observed and true scores of the examinees, it appraises the interlinks between scores obtained by two parallel exams.This estimation explains that an individual's scores are expected to change when retested without alteration in knowledge and perception with the same or any equivalent test [14][15][16].Increasing the item numbers in a given exam can augment the reliability but it is expensive, needs time and average correlation effort.Cronbach's alpha of 0.8 or more is needed for high-stakes exams, however, usually, there is a fixed item number in licensure or high-stakes exams; so, you can use other alternatives by increasing the deployment of the obtained exam scores, for example, test variance.Range of scores/performances as moderately difficult (DI: 0.4-0.8)and sufficient discrimination point biserial correlation (RPB) more or equal to 0.2.It can also increase the standard deviation and the variance of the scores [23][24][25].For the assumption that any test can contain score error, SEM is used to estimate the interval within which the true score will be obtained.When the SEM is small, the interval will be narrower and more precise.
The Kuder-Richardson Formula 20 (Kr-20) measures internal consistency and reliability of an examination.It measures the interior uniformity of the exam with many options.Kr-20 > 0.90 indicates a homogenous test.Kr-20 = 0.8 is acceptable but >0.8 is nonreliable [17].

Statistical steps of item analysis
The Statistical Analysis System (SAS), Statistical Package for the Social Sciences (SPSS), and similar software are used in data analysis.After conducting the exam, data are gained manually or electronically and then entered into Microsoft Excel sheet, SPSS, or any other statistical methods of your choice.Next, the data are analyzed to get: the mean, standard deviations (SD), unpaired t-test, and coefficient of variation, DIF-I and DIS-I, and (DE) [18].
1. Difficulty index (DIF-I) is described as the examinee's incapability to reply to the item correctly.To calculate it: rank the examinees in order, then pick one-third of the high or greater achievers (HA) who correctly answered to the item and one-third of the lower achievers (LA) who also choose the correct answer.
It can be calculated using the following formula: DIF-I = [(HA + LA)/N] × 100, where: N is the total number of students in the two groups.
DE is classed as Functional Distractor (FD) when chosen by ≥5% of the examinee and as Non-functional Distractors (NFD) if chosen by <5%.NFDs include options other than the right answer chosen by <5% of the examinees.Implausible distractors can be noticed easily, so they ought to be modified or rejected [18][19][20][21][22][23].

Item flaws
Faults in item-writing can also influence the overall performance by making questions challenging or too easy.
Example: The use of absolute terms like always, never, or choosing the right option in a lengthy sentence.It is wise to refrain from.using negative words like none of the above OR except.
Grammatical flaws may divert the examinee to the right answer and make the questions easy.Items with many NFDs reduce the DE and DIS-I [24][25][26].

The number of item options
Some authors argue that MCQ with three options needs much less time for construction with a greater chance for high reliability and validity than four-five options.Others say that MCQ choices can be three or even two and have the potency to give the same results as 4 or 5 options without affecting the examination quality [27][28][29].
As cited earlier, evaluation is an essential measure not only for competent graduates but also for college enhancement and quality assurance [30][31][32][33].Valid evaluation techniques aligned with accrediting authorities' requirements are one of the desires for excellence and accreditation.It elevates the importance of the assessment-unit building to lead all evaluation activities within the institute.

Conclusion
Item analysis of MCQs is a statistical tool used to assess students' performance on a test, identify underperformed items, and determine the root causes of this underperformance for improvement in order to ensure effective and accurate students' competency judgment.It is a potent tool to appraise the ILOs in a short time, detect gaps in curriculum contents evident by student's poor performance in a test, and identify strengths and weaknesses in teaching strategies and methods.Exam reliability and validity are important for defining the result obtained to meet the requirements and measure bias.
Training and retraining of all faculty members are important to improve their skills in properly standardizing MCQs construction to overcome any assessment challenges.

2 .
Absolute approaches (criterion-referenced): judgment based on: (a) Exam content: used in high-stake conditions like licensure; e.g., Angoff (1971), Nedelsky (1954), and Ebel (1972) methods where the Standards setter decides the borderline examinee's criteria.(b) Compromise: The well-known one is Hofstee, which can be used in a lowresource setting.The designers decide the MPL after consensus.All of the above techniques should be executed before conducting the exam[5][6][7][8].

2 .
Discrimination index (DIS-I) is defined as the ability of an item to differentiate between students with high-and low exam scores.It ranges from -1.00 to +1.00.Those with high value are good discriminator items.Negative DI can be obtained if the low achievers get more correct answers than the high achievers, and vice versa.DIS-I can be calculated using the formula:DIS-I = [(HA -LA)/N] x 2,where: HA are the high achievers while LA are the low achievers in the test.DIS-I can range from 0 to 1; if it is <0.15, it means a poor discriminator; 0.15-<0.25 means good discriminatory items; >0.25 means excellent discriminator[9-12, 20-  23].RPB is another way of measuring item discrimination, defined as the correlation between the item score and the total test score.It is mathematically equivalent to Pearson's correlation.Both DIS-I and biserial correlation are greatly correlated, and a DIS-I or RPB < 0.2 is regarded low[10][11][12][13][14][15][16].

3 .
Distractor Analysis aims to determine the capability of item options to distract the examinee when selecting the right answer.Each distractor must be assessed for its frequency of selection by the examinee, it is called DE [18-23].DE can be calculated using the formula: DE = Frequency of distractor selection ÷ Total no. of item respondent × 100.DE needs to be assessed in each MCQ to test the presence or absence of NFD.