Examinee Characteristics and their Impact on the Psychometric Properties of a Multiple Choice Test According to the Item Response Theory (IRT)

The aim of the current study is to provide improvement evaluation practices in the educational process. A multiple choice test was developed, which was based on content analysis and the test specification table covered some of the vocabulary of the applied statistics course. The test in its final form consisted of 18 items that were reviewed by specialists in the field of statistics to determine their validity. The results determine the relationship between individual responses and the student ability. Most thresholds span the negative section of the ability. Item information curves show that the items provide a good amount of information about a student with lower or moderate ability compared to a student with high ability. In terms of precision, most items were more convenient with lower ability students. The test characteristic curve was plotted according to the change in the characteristics of the examinees. The information obtained by female students appeared to be more than the information obtained by male students and the test provided more information about students who were not studying statistics in an earlier stage compared with students who did. This test clearly indicated that, based on the level of the statistics course, there should be a periodic review of the tests in line with the nature and level of the course materials in order to have a logical judgment about the level of the students’ progress at the level of their ability. Keywords-item response theory; item characteristics; multiplechoice; psychometric properties


INTRODUCTION
A test is an educational tool that is frequently used to evaluate students' academic achievement and progress. Tests also provide an opportunity to verify students' skills in many educational situations when it is not possible to use other assessment methods. Despite the known problems of indirect measurement, a lot of traits such as mathematical abilities, verbal skills, resistance to stress, intelligence, dissatisfaction, different opinions about a particular topic, etc. cannot be directly observed and measured [1][2][3]. These are known as the latent traits, and they can be measured only indirectly, often using specially prepared questionnaires where the responses are closely related to the specific traits being studied. Tests are frequently used to assess students' cognitive progress and to build question banks. As a result, the so-called latent trait models have been developed and are used to estimate the parameter values associated with the human personality [4][5]. These models provide a different type of information that in turn helps to develop and improve tests accordingly. Many researchers rely on the information from data that are analyzed as a result of the subjects' responses. But it is important to ask whether the formulation of stimuli (questions) may provide another type of data or information that serves the research process [1,[6][7][8].
Many researchers and specialists in the field of measurement and evaluation, are interested in the basic concepts and organizational theoretical frameworks of measurement and evaluation and ways to apply them [9][10][11], due to the great role this science plays in various fields of scientific research in general and educational and psychological research in particular. Research activity according to the needs of educational institutions will positively affect development and improvement in accordance with Saudi Arabia's Vision 2030. Furthermore, the means of developing tests and measurement methods are extremely important because the data issued from the measurement processes have to be valid and accurate, as some crucial decisions such as admission or promotion may be based upon them [12][13][14][15][16]. In addition, it is the responsibility of specialists in the field of measurement and evaluation to enrich the literature, develop tests used in the educational field, and reduce the possibility of potential measurement errors during the evaluation process [17,18]. Postgraduate tests in Saudi universities have not yet been subjected to much scrutiny by local and international evaluation institutions, because most of the quality assurance agencies, such as the National Commission for Evaluation and Accreditation (NCAAA), has only recently included postgraduate programs in its plans, providing the opportunity for universities to apply for accreditation for these programs. Midterm and final exams and the way they are administered are some of the indicators used by the NCAAA or other agencies to accurately judge the progress of an academic program. Therefore, improving these tests has become a necessary requirement.

II. SIGNIFICANCE OF THE CURRENT STUDY
The significance of the current study stems from the importance of the evaluation processes in the educational process. The process of improving tests and identifying their psychometric characteristics is the task of those working in the field of measurement and evaluation in order to provide a comprehensive understanding and a deep descriptive analysis of the advantages and disadvantages observed in those tests [19][20][21]. Furthermore, the scarcity of this type of scientific studies has widened the gap between the tools currently used to measure the level of achievement of students and what is hoped that these tools should be like. The quality of these tools has not been determined or reviewed, and therefore they have not been assessed or evaluated [22][23][24][25]. The practical significance of this type of research lies in the use of Item Response Theory (IRT) models in analyzing students' responses to the achievement test in a more objective way, showing whether there is an effect of the multiplicity of characteristics of the participants on the test items (multiple choice) in terms of the accuracy of the estimates of the items' parameters and the individuals' ability parameters. It can also guide the composers of the test questions to take into account some points that may affect the psychometric properties of the items and the test and the accuracy of its results [26][27][28][29].
III. ITEM RESPONSE THEORY There are many ways (i.e. models) to determine the relationship between individual responses and student ability. Within the framework of modern measurement theory, many models and applications have been formulated and applied to real test data. This includes the measurement assumptions about the characteristics of the test item, the performance of the subject, and how this performance is related to knowledge [27-37-40]. Tests and evaluation processes in general form the basis of the education system, and their importance lies in improving educational planning, developing a mechanism for enhancing curricular content, measuring learners' competence, and comparing student performance or achievement data. Evaluations also have a role for schools and teachers [30][31][32][33]. Tests are a tool of assessment, and their quality depends on a large extent on the nature and quality of the information collected during the preparation of the assessment. Over the decades, the test building system has undergone a lot of development through the emergence of many test building theories focused in many types of tests, such as oral tests, standardized tests, and realistic evaluation. Until today, theories have continued to develop in order to keep up with the changes in policies and new educational practices [2]. The modern theoretical methods were largely developed in the sixties to the late eighties. The IRT is a general statistical theory considering the characteristics of the test item, the subject's performance on the item, and how the performance is related to the abilities that are measured by the test items [12,[34][35][36]. The IRT provides a rich statistical tool for analyzing educational tests and psychometric measures. The IRT assumes the following: • The test performance of the subjects can be predicted (or explained) by a set of factors called traits or latent traits and abilities.
• The relationship between the subject's performance and the properties of the test item can be described through a monotonic increasing function called the item information function.
• The response on the test item can be either isolated or continuous and it can be binary or bimodular. Item score categories can be ordered or unordered, and there can be one or many abilities behind the test performance.

IV. CHARACTERISTICS OF THE IRT MODELS
• The IRT model should be defined as the relationship between the observed response and the unobserved infrastructure (latent trait).
• The model should provide a method for estimating the degrees of the latent trait.
• The subjects' scores will be the basis for the assessment of the basic construction of the model.
• The IRT model assumes that the subject's performance can be predicted or explained by one latent trait or more.
In IRT is often assumed that the examinee has some unobservable latent trait (also called latent ability), which cannot be studied directly. The purpose of the IRT is to propose models that allow linking these underlying traits to some of the characteristics that can be observed on the subject [41]. There are many models in the IRT and they have been classified into two types: models that use the cumulative natural curve and logistic models. Logistic models are currently more widespread, they are suitable for two-stage items, and differ according to the number of the estimated item parameters [6,31,[42][43][44][45]. However, there are three commonly used models for binary data, which use (1) for the correct response and (0) for the wrong response, and these models are the one parameter and the two-parameter logistic models, which will be examined below.

A. One-Parameter Logistic Model
The concept of information availability plays an important role in IRT as it can be used to evaluate how the item included in the test accurately measures the level of the latent trait with (parameter value θi). This latent trait could include, for example, the level of the student's knowledge, intelligence, ability, satisfaction, stress, etc. For example, in educational tests, the item parameter represents the difficulty of the item while the subject parameter represents the ability level of the people being evaluated. The greater the subject's ability in relation to the difficulty of the item (the parameter αj describes the degree of difficulty of the item and the level of influence of the item on the subject), the greater the probability of a correct response to that item. Whereas, when the subject's position on the latent trait is equal to the difficulty of the item, according to Rasch's model, there is a 0.5 probability that a subject's response is correct. Accurate information about the value of θi depends on a number of factors, the most important of which is the properties of the questions (items) used to evaluate the parameter (the latent trait) [2,30,33,46].

www.etasr.com
Almaleki: Examinee Characteristics and their Impact on the Psychometric Properties of a Multiple …

B. Two-Parameter Logistic Model
In the Two-Parameter Logistic (2PL) model, the situation is different from the one in the one-parameter model. The oneparameter model assumes that questions differ only with respect to item difficulty, whereas, in the two-parameter logistic model, two parameters are assumed to be connected to the test item: the parameter αj which describes the difficulty of the item (question) and the additional parameter βj, which describes the discrimination of the item. The parameter β (the slope of the curve) describes the degree to which the question helps to distinguish between the subjects with the highest level of a trait compared to those with a lower level of the same trait. This parameter also shows the extent of the relevance of the item to the overall score of the test. The higher the value of that parameter, the greater the discrimination of the items (and the easier it is to select subjects with a high level and those with a low concentration of the same trait). It should also be noted that the most difficult test item is not necessarily the test item with the highest potential to discriminate between the subjects [2,19,36,47,48].

C. Three-Parameter Logistic Model
The Three-Parameter Logistic (3PL) model is used in IRT, and it determines the probability of a correct response for a dichotomously scored multiple-choice item as a logistic distribution. The 3PL model is an extension of the 2PL logistic model as it introduces the guessing parameter. Items now differ in terms of discrimination, difficulty, and probability of guessing the correct response [47]. After adding the guessing parameter, denoted Ci, in the 3PL model, this parameter is the lower asymptote of the item characteristic curve and represents the probability of subjects who have a low ability to answer the item correctly. The parameter is included in the model to account for item response data from low-ability subjects, where guessing is a factor in test performance [48][49][50]. The basic equation for the 3PL model is the probability that a randomly selected examinee with a certain proficiency level on scale k will respond correctly to item j, which is characterized by discrimination (αj), difficulty (βj), and guessing probability (Ci) [27,35,37,38,51].
V. MULTIPLE CHOICE TEST ANALYSIS Understanding how to interpret and use the information based on students' test scores is just as important as knowing how to create a well-designed test. An essential part of building tests is using the feedback from a good test analysis. Among the most important statistical information provided by a good analysis of a multiple-choice test are the following:

A. Item Difficulty
The test item difficulty factor βj represents the percentage of the respondents who answered the item correctly. The difficulty factor ranges from 0.0 to 1.00. The higher the value of the difficulty factor, the easier the test item is. For example, when the value of the difficulty factor βj is higher than 0.90, the test item is described as very easy and should not be used again in subsequent tests since almost all students are able to properly respond to it. Whereas when the value of the βj is less than 0.20, the test item is described as extremely difficult and should be reviewed in subsequent tests. The optimal test item difficulty factor is 0.50, and it insures maximum discrimination between high and low ability [52][53][54]. To maximize item discrimination, the desired difficulty levels are slightly higher than halfway between the probability of answering correctly by chance (1.00 divided by the number of alternatives for the item) and the ideal score for the item (1.00) [55][56][57][58]. For example, if the test item contains four alternatives to the answer, the probability of answering it correctly by chance would be 0.25 (1.00/4=0.25), and the ideal degree of difficulty for the item can be calculated by substituting in the following rule: ((Ideal score for item -probability of a correct answer by chance) / 2) + probability of a correct answer by chance

B. Item Discrimination
The test item discrimination factor is referred to using the symbol αj, as it represents the point relationship between the respondent's performance on the item and the respondents' total scores. The discrimination factor value ranges from -1.00 to 1.00. When the value of the test item discrimination factor is high, it indicates that the test item is able to distinguish between respondents. It distinguishes between those who scored high in the tests and were able to answer the test item correctly and those who obtained low test scores and were not able to respond to the item correctly [54,59]. Test items that have point values close to or less than zero should be removed. Moreover, further consideration should be given to the item which was responded to better by those who generally performed poorly on the test than those who performed better on the test as a whole. The test item may be confusing in some way to top-performing respondents [52,53,58,59]. any bias and its suitability for the target group, regardless of gender, race, and religion. The psychometric properties of the test are tested to verify that it is objectively constructed and free of any bias.
Studying the rules for formulating multiple-choice test items is important because it has an impact on the level of performance on the items or the test as a whole. This means that the good construction of the test and the verification of all its psychometric properties ensures that the test avoids any violations in the structure of the items, which in turn affects the individual's performance on the test items [36,[60][61][62].

VII. METHODS
The descriptive survey method was used to obtain data from a real-life scenario of giving postgraduate-level midterm and final exams to assess master's students' achievement level in the subject of applied statistics. This analytical study aimed to determine the quality of the test, its efficiency, and the reliability of its results despite the varying circumstances in which it is given.

A. Measurements
A Criterion Referenced Test (CRT) was used to evaluate students' achievement in the course of applied statistics in a master's degree level in order to verify the quality of the test as a tool to evaluate the level of students' achievement. The test was developed and based on content analysis. The test specification table covered some of the course vocabulary for the applied statistics course. The test in its first form consisted of 25 test items that were reviewed by specialists in the field of statistics to determine their face validity, and 7 items were omitted as a result. So, in its final form, the test had 18 items, and they were applied to the study sample to verify their quality (Table III). The results of the thorough analysis of the test items were handed over to the central question bank in order to compare the performance of the test items and verify their lifespan when re-performing statistical operations on them later. To verify the test reliability, the Kuder Richardson (KR-20) method was used because the binary data are coded using 0 and 1 after correcting the items, and because the test items differ in their difficulty parameter. The results indicated that the test has a high reliability coefficient of KR-20 = 0.842.

B. Sample
The current study population consisted of all students of the applied statistics course at the master's level at Umm Al-Qura University on the main campus in Makkah and all branches of the University. The size of the population was rather large, estimated at about 400 male and female students registered during the second semester of the academic year 2020. It was difficult to reach all the sample members because of the financial cost, time, and effort required. Moreover, the educational and population environment conditions for all students are very similar and the previous studies using samples from students in the Saudi universities' population did not show any clear bias. Therefore, the current study used a random sample consisting of 338 students, equivalent to the 84.5% of the study population, studying different disciplines in the College of Education. sample were contacted by e-mail, and a test link was created and made available on the student's electronic page. The link was made available for one hour, representing the testing period, and prior coordination with the study sample was made to select a time appropriate for everyone. Cases which had any type of technological problems were not recorded. Electronic reminders via the university's electronic system were used before the test to alert the study sample about the test time.  Yes  56  51  52  42  34  235  No  18  18  14  20  33  103  Total  74  69  66  62  67  338 VIII. RESULTS Figure 1 show the eigenvalue scree plot. It is clear that the first eigenvalue is much greater than the others, suggesting that a unidimensional model is reasonable for this data. This test has items with one correct alternative answer that worths a single point, the item difficulty was simply the percentage of students who answer an item correctly, which is equal to the item mean. In this case, the item difficulty index in Tables VI to X shows the ranges, based on the examinees' characteristics. The item difficulty ranged from 0.647 to 0.928. For students who studied statistics in earlier stages, the items' difficulties ranged from 0.878 to 0.970, however for students who did not, the items' difficulties ranged from 0.310 to 0.893, while the items' difficulties based on gender ranged from 0.650 to 0.966 for male and 0.645 to 0.924 for the female students.
Regarding the students' current major, the item difficulty varied over subjects: for special education, it ranged from 0.522 to 0.985, for educational administration it ranged from 0.50 to 0.870, for curriculum and instruction it ranged from 0.651 to 0.999, for Islamic education it ranged from 0.521 to 0.927, and for psychology it ranged from 0.837 to 0.999. For higher GPA the items' difficulties ranged from 0.857 to 0.994, whereas for moderate GPA they ranged from 0.400 to 0.936, and for low GPA they ranged from 0.021 to 0.869.  Figures 2 -19 present the combined curve for the 18 items based on the overall data. Each item has one threshold. Most thresholds span the negative section of the ability. Item information curves show that the items provide a good amount of information for students who had a lower or moderate ability compared to high ability students. In terms of precision, most of the items were more convenient to lower ability students (e.g. items 9, 10, 17), while items 2, 5, 14 gathered more information for students who had moderate ability.  Figure 20 represents the test characteristic curve that was the functional relation between the true score and the ability scale. As we can see the probability of the correct response was near to 1 at the lowest levels of ability and it increased until it came to the highest levels of ability. The probability of correct response in this test is about 18 for high ability students. Figure  21 presents the total amount of the information that has been obtained from the test. It appears clearly that the test gives good indicators to assess lower levels of agreement.                  The amount of information that has been obtained by female students appeared to be more than the information extracted by male and the test provided more information about students who were not studying statistics in earlier levels compared to students who were. Figures 27 and 28 also confirm that the test provides a good amount of information for students who had a lower or moderate ability compared to students who had high ability.

IX. DISCUSSION
The study findings support previous researches [5][6][7]63] for the use of IRT models in analyzing students' responses to an achievement test in a more objective way, showing whether there is an effect of the multiplicity of characteristics of the participants on the test items (multiple choice) in terms of the accuracy of the estimates of the items' parameters and the individuals' ability parameters. The results also help determine the relationship between individual responses and their basic ability. Within the framework of modern measurement theory, many models and applications have been formulated and applied to real test data. This includes the measurement assumptions about the characteristics of the test items, the performance of the subjects, and how performance is related to knowledge. This test clearly indicated that based on the level of the statistics course, there should be a periodic review of the tests in line with the nature and level of the course materials in order to have a logical judgment about the level of students' progress and the level of their ability. This conclusion is consistent with the findings of [50,64].

X. CONCLUSION
In general, this study seeks to further improve the evaluation practices in the educational process. The tests that describe the students' progress in the educational process should be subject to review by evaluation and measurement specialists in order to ensure that we have valid and reliable evaluation tools. The administrators of the educational system need to find a mechanism to review question banks and align them with the requirements of the scientific material of the courses that are subject to continuous development.