Evaluating the fairness of a high-stakes college entrance exam in Kuwait

The use of college entrance exams for facilitating admission decisions become controversial, and the central argument is around the fairness of test scores. The Kuwait University English Aptitude Test (KUEAT) is a high-stakes test, but very few studies have examined the psychometric quality of the scores for this national-level assessment. This study illustrates how measurement approaches can be used to examine the fairness issues in educational testing. Through a modern view of fairness, we assess the internal and external bias of KUEAT scores using differential item functioning analysis and differential prediction analysis, respectively, and provide a comprehensive fairness argument for KUEAT scores. The analysis for examining the internal evidence of bias was based on 1790 examinees’ KUEAT scores in November 2018. KUEAT scores and first-year college GPAs of 4033 students enrolled in KU were used for assessing the external evidence of bias. Results revealed many items showing differential item functioning across student subpopulation groups (i.e., nationality, gender, high school majors, and high school types). Meanwhile, KUEAT scores also predicted college performance differentially by different student subgroups (i.e., nationality, high school majors, and high school types). Discussion and implications on the fairness issues of college entrance tests in Kuwait are provided.


Introduction
The use of college entrance exams, such as the Scholastic Assessment Test in the United States, for making admission decisions has been controversial for years.The central argument is around the fairness of test scores in determining university admission.Rhoades and Madaus (2003) indicated that college entrance exams contain biased items against minority groups.Zwick (2007) also revealed that students from historically under-resourced, marginalized, and underrepresented populations cannot afford test preparation coaching which resulted in lower scores on standardized testing.However, compared with other measures, these exams provide "a neutral yardstick" for comparing the performance of students from different high schools that greatly vary in terms of contextual factors such as course offerings (Buckley et al., 2018), and objectively measure the academic achievement of students (Churchill et al., 2015).
In Kuwait, the Ministry of Education and the Ministry of Higher Education in Kuwait yearly provide a minimum required high school GPA for Kuwait students who want to apply for a fully funded governmental scholarship that covers tuition expenses with a monthly stipend.The funded students have the choice of studying abroad or in the country for their college education.Many students prefer to study in the country, making admission to the only public university-Kuwait University (KU)-competitive, especially in medical and engineering colleges.The admission decision to KU is based on two criteria: the high school grade point average (GPA) and aptitude test scores.When KU was founded, admission decisions were based only on high school GPA.In 1997, KU collaborated with the Ministry of Education in Kuwait and added the requirement of taking and submitting the aptitude test scores.The Kuwait University English Aptitude Test (KUEAT) is one of the college entrance tests at KU.This test was developed by the faculty of participating colleges at KU in coordination with the Ministry of Education in Kuwait.The high school GPA and the test scores are combined to compute a weighted average score.
The aptitude test scores are very important for making admission decisions, especially with the inflation of high school GPAs.The percentage of 12th grade students in Kuwait who scored 90% or above on their high school GPA increased from 25.43% in 2016 to 47.16% in 2022 (Hasan, 2022), and remained at 41.09% in 2023 and 41.80% in 2024 (Hasan, 2024).Due to the inflation, Gershenson (2018) reported a low discrimination power of high school GPAs for distinguishing students at different proficiency levels, and among students with high GPAs, only a small proportion received high scores on the statewide end-of-course exams.Additionally, the public prosecution detected 40,000 students cheating in the Fall 2022 final exams because the students accessed the test material before the test time with the help of the Ministry of Education employees (Habib and Al Hamadi, 2023).
The inflation of high-school GPA and the cheating crisis in Kuwait undermined the validity and credibility of high-school GPAs which in turn highlighted the need for a standardized test for college admission.Standardized tests can be administered to a large-scale student body within a short time frame and in relatively similar testing conditions.A well-developed test can provide valid, reliable, fair, and comparable scores that support the decisions of college admission.This requires a comprehensive psychometric analysis of test scores including the detection of misfit and biased items.
There is a lack of research for examining and reporting the psychometric quality of KUEAT scores, especially with modern measurement approaches (e.g., Rasch measurement theory).Additionally, KU has never published technical reports on the KUEAT regarding the development process, administration process, scoring procedure, and psychometric properties.In the existing literature, two studies (Eid, 2009;Shamsaldeen, 2019) examined the prediction validity of KUEAT scores with college performances as the criteria.Eid (2009) investigated the predictive validity of KUEAT scores for students who graduated from high school in 1999, 2000, and 2001, and indicated a non-significant relationship between KUEAT scores and college performance.Shamsaldeen (2019) found that KUEAT scores were a significant predictor of second-year college GPAs for science, technology, engineering, and mathematics students.However, KUEAT scores only explained a small amount of variation in college GPAs.Both the secondary and higher education systems in Kuwait have been evolving over the years which may explain the contradictory results.This also highlights the necessity of developing, evaluating, and maintaining KUEAT scores regularly based on psychometric theory.

Purpose of the study
Researchers worldwide have been attending to the bias of high-stakes test scores and investigating the potential factors (e.g., Huang et al., 2016;Sabatini et al., 2015).However, the KUEAT lacks evidence to support the fair use of their scores in making admission decisions.
This study aims to fill the gaps by suggesting a comprehensive view of fairness and examining the internal and external bias of KUEAT scores for assessing psychometric properties with a special focus on fairness.The reliability and validity evidence are also collected as two related foundational areas to fairness.Specifically, differential item functioning (DIF) analysis based on Rasch measurement theory is used to assess the internal bias across student subpopulation groups.A moderation analysis based on a regression model is used to detect differential prediction and examine the external bias of test scores (AERA, APA and NCME, 2014).The demographic groups considered in this study include nationality, gender, high school type, and high school major.In particular, we address the following research questions: 1. How well does the KUEAT scale reflect student English proficiency (i.e., do KUEAT items fit the underlying scale-construct validity)?2. How well can students at different English proficiency levels be separated by the Rasch scale (i.e., reliability of separation for students)?3. Are KUEAT items equally difficult across different student subgroups (i.e., does any item show DIF-internal evidence of bias)? 4. How well do KUEAT scores predict first-year college GPA for different student subgroups (i.e., do KUEAT scores show differential prediction-external evidence of bias)?

Fairness as a foundational area of psychometrics
In accordance with the Standards for Educational and Psychological Testing (abbreviated as Test Standards hereafter; AERA, APA and NCME, 2014), fairness is defined as ensuring that all test takers possess similar opportunities to show their abilities in relation to the construct of a test intends to measure.Tests yielding fair scores should exhibit an absence of bias, impartial treatment of all examinees throughout the testing procedure, and equal access to learning the material.Fairness spans from the conceptualization of assessment to the utilization of assessment outcomes across all subpopulation groups throughout every assessment phase.While some researchers focus on the consequences of test utilization, others prioritize the validity of inferences drawn from test outcomes (Camilli et al., 2013).Fairness is a crucial foundational component for developing, evaluating, and using educational assessments.
It is crucial to understand the difference between fairness and bias.Fairness is a societal concept, while bias is a psychometric measure.The Test Standards emphasize a comprehensive understanding of fairness that scrutinizes numerous facets related to the purpose of testing (AERA, APA and NCME, 2014).This entails considering the technical properties of tests, how test scores are reported, the consequences of score uses, and the construct-relevant and irrelevant variables attributed to the performance of individuals and subpopulation groups (e.g., Wang et al., 2020).Bias refers to an anomaly in measurement, and fairness can be viewed as the outcome of who is advantaged or disadvantaged from this anomaly (Camilli, 2006).The bias of an item or a test against a particular subgroup would undermine the score fairness.We can quantify the degree of bias using statistical approaches, such as DIF and differential prediction analysis.
A test demonstrates a bias towards a subpopulation group if consistent non-zero prediction errors occur for members of that subgroup in predicting a criterion for which the test was designed (Cleary, 1968).This could be attributed to the test items requiring sources of knowledge that differ from those intended to be measured, thereby introducing systematic bias and reducing the validity of test scores for a specific group (Camilli & Shepard, 1994).Statistical results may reveal potential bias of test scores against a subpopulation group.Meanwhile, substantive explanations are needed to justify the reasons behind the numbers.
Two sources of bias should be investigated using statistical analysis, which are internal and external evidence of biases.The internal evidence of bias can be quantified by the difference in their probabilities of correctly answering an item between two groups possessing equivalent ability, attributable to construct-irrelevant factors being measured (Camilli, 2006).The DIF analysis is often used to examine item-level bias to collect internal evidence of bias.It examines whether or not an item functions the same way for examinees at the same ability level regardless of their group membership.
The external evidence of bias is assessed by observing differential prediction of test scores on criterion variables across various subgroups (Camilli, 2006).The Test Standards (AERA, APA and NCME, 2014) state that "... test developers and/or users are responsible for evaluating the possibility of differential prediction for relevant subgroups for which there is prior evidence or theory suggesting differential prediction" (p.15), and suggests that "…differential prediction is often examined using regression analysis" (p.66).Specifically, a regression model with demographic variables as moderators is used for differential prediction analysis.We attend to the relationship between test scores and college performance and how this relationship differs between different subpopulation groups.In summary, the internal evidence relates to the bias at the item level, and the external evidence concerns the bias at the overall test level.
Validity and reliability are two other foundational areas of psychometrics for educational and psychological testing.In some sense, the test scores must be valid and reliable before we attend to the fairness issues.As indicated by Fig. 1, validity, reliability, and fairness make a stable system for assessing the psychometric quality of a testing instrument.Each foundational component contains internal and external evidence.For instance, internal consistency indicates how reliable the scores are, and generalizability reflects how stable the scores can maintain reliability across different conditions.The Test Standards (AERA, APA and NCME, 2014) encourage using integrated validity argument consisting of five forms of validity evidence.Internal evidence based on test content, internal structure, and response process addresses how well the scores reflect the measuring construct.The relations to other variables reflect the external validity of how comparable the current measures are compared with criterion measures, e.g., tests that measure the same construct or future performance.The consequences of testing refer to the positive or negative social consequences of a particular test, which contributes to the external evidence of validity.Fairness can be viewed as maintaining valid and reliable scores across different subpopulation groups.Internal bias occurs when the internal structure varies as a function of group membership, and external bias shows if relations to other variables differ across subpopulation groups.

Data description
A secondary dataset was used for analyses.The data were obtained from the Center of Evaluation and Measurement at KU which did not contain identifiable information about individual test-takers.Three parallel forms with the same items in different sequences were administered in November 2018.For test security reasons, the complete test booklets and item content were not provided by KU.Therefore, the examination of internal bias was based on one particular test form using the item-level responses from test-takers who took that form of KUEAT (Sample 1).The external evidence of bias was collected based on students who were accepted by KU and started their college in 2019/2020 (Sample 2).
The datasets used and analyzed in this study are available from the Center of Evaluation and Measurement at KU upon request.Rasch (1960Rasch ( /1980) introduces a dichotomous Rasch model for obtaining objective measures for a unidimensional construct.The probability of answering an item correctly is a function of a person's latent ability and an item's difficulty.The Rasch model for dichotomous responses is shown below.

Rasch-based differential item functioning analysis
Where P ij is the probability of person j correctly answering item i, θ j represents English proficiency of person j, and b i shows the difficulty of item i.
Before examining the fairness of scores, we need to evaluate how valid and reliable the scores are (Engelhard & Wang, 2021).Rasch models provide residual-based Outfit and Infit mean square statistics for examining the internal structure of the scale.Outfit is an outlier-sensitive statistic that detects outlying responses.Infit is an informationweighted statistic that identifies unexpected response patterns.For high-stakes multiplechoice items, Linacre (1994) suggests a range of acceptable fit to the scale between 0.8 and 1.2.Lower than 0.8 indicates overfit to the Rasch scale with a lack of variation in response pattern, and greater than 1.2 reflects underfit with more variations in responses than expected (Linacre, 1994).The fit indices are used to examine individual-level item fit.Meanwhile, to assess whether the underlying scale measures a single construct, that (1) is unidimensional, the proportion of variances explained by the measures is used.It is debatable upon an appropriate cut-off value for the variance explained by the model.In practice, unidimensional Rasch models are recommended when the proportion is greater than 20% (Hambleton & Rovinelli, 1986).
Rasch models provide a reliability of separation index, that can be interpreted as internal consistency of person scores.This index is comparable to Cronbach's alpha in a way that ranges from 0 to 1 and a higher value indicates higher consistency.The reliability can be transformed into a separation index where the reliability is calculated as the number of standard errors of spread among the elements (Wright & Masters, 1982).It is also unique since the latent scores are used for computing this statistic instead of observed or raw scores.The use of latent scores makes this index to be more useful for examining the reliability of the underlying latent scale.Low reliabilities values (i.e., close to 0) indicate a narrow range of location estimates, and high reliabilities (i.e., close to 1) indicate a wide range of estimates (Bond et al., 2021).
For assessing the internal bias at the item level, DIF is conducted using the Rasch-Welch t-test and examines if an item is equally difficult between different subgroups that are defined by examinees' demographic characteristics such as gender, nationality, high school type, and high school major.The Welch t statistic is shown below.
Where b 1 and SE 1 are the item difficulty measure and its standard error for Group 1, and similarly, b 2 and SE 2 are for Group 2. Bond and Fox (2015) suggested that an item exhib- its DIF if (a) the significance test for t-statistic produces a p value below 0.05 and (b) the DIF contrasts ( b 1 − b 2 ) as effect sizes are greater than or equal to 0.5 logits.
Item-level responses to KUEAT from Sample 1 are used for DIF analysis.The data are analyzed using a Specialized Rasch analysis program-Winsteps software (version 5.3.1;Linacre, 2022) with a joint maximum likelihood estimation method.

Regression-based differential prediction analysis
The differential prediction analysis is used to detect external bias in test scores.We use the regression model with demographic variables as moderators.Each regression model contains a single moderator to examine whether differential prediction exists across different subpopulation groups defined by the demographic variables of examinees.This is clearer in terms of interpretation than analyzing all demographic groups altogether.The regression model is specified below.
Where Y denotes first-year GPA, X 1 refers to KUEAT scores, X 2 is the dummy-coded demographic variable or moderator (i.e., gender, nationality, school type, and high (2 school major), β 's are the regression coefficients, and ε is the residual term.Among all regression coefficients, β 3 is of primary interest for addressing research questions.After accounting for the main effects of KUEAT scores ( β 1 ) and gender ( β 2 ), the interaction effect specifically indicates the prediction of KUEAT scores on first-year GPA differs for different demographic groups.If β 3 is significantly different from zero reflecting a moderation effect, then differential prediction of KUEAT scores occurs between demographic groups.
The differential prediction analysis is conducted in R Statistical Software (v4.2.1; R Core Team, 2022) based on Sample 2.

Results
A dichotomous Rasch model is fitted to calibrate student responses to KUEAT items.A variable map is shown in Fig. 2 with persons and items located along a common latent scale.Examinees with higher English proficiency are located at the top, and less proficient students are located near the bottom.For items, more difficult items are located on the top, and easier items are located at the bottom.The Rasch analysis results are used for addressing research questions 1 to 3. Then the moderated regression analyses are used to address research question 4.

RQ1: How well does the KUEAT scale reflect student English proficiency?
First, the constructed Rasch scale can explain 33.4% of the variation in student responses, which supports that KUEAT measures a single dimension of English proficiency.Next, we obtained the Outfit and Infit mean square statistics for individual items.Results indicated 43 items with Outfit values outside the acceptable range (i.e., 0.8-1.2;Linacre, 1994), and 12 items showing misfit based on Infit statistics (Fig. 3).

RQ2: How well can students at different English proficiency levels be separated by the Rasch scale?
The variable map shows a roughly normal distribution of students' English proficiency estimates.A few examinees with locations above the most difficult item (i.e., q25 and q26) answered all the items correctly, and those with locations below the easiest items (i.e., q45 and q5) got all items wrong.The Rasch analysis provides a person reliability of separation index for assessing how distinctive person estimates are along the Rasch scale.Our analysis returned a value of 0.94.It indicates that the underlying Rasch scale with calibrated KUEAT items can distinguish the examinees with different English proficiency levels.This scale can reach great replicability based on similar items for measuring the same construct.For comparison purposes, we obtained Cronbach's alpha using the observed scores.The alpha coefficient was 0.93, suggesting great internal consistency of KUEAT items.

RQ3: Are KUEAT items equally difficult across different student subgroups?
After placing all examinees on a common latent scale along with calibrated items, the DIF analyses based on t-tests were conducted for each demographic variable, including nationality, gender, high school type, and high school major, to examine if any item shows bias against a particular subpopulation group.Statistically, the Rasch-Welch t-test was used to examine if an item is equally difficult across subgroups.An item is flagged as displaying DIF when item difficulty measures were significantly different ( p < .05 ) between subgroups.
Fig. 2 Variable map for Kuwait University English Aptitude Test Calibration.Note.Each "#" represents 10 examinees and each "." represents 1 examinee on the left side of the scale.Persons are ordered based on their latent English proficiency.An examinee with a higher location measure reflects higher proficiency.The test items are displayed on the right side with a higher location indicating higher difficulty Table 2 shows the t-test results for DIF items across nationality, gender, and high school majors.The DIF items by high school types were displayed in Fig. 4 separately.Results indicated five items showing DIF between nationality groups (Table 2, panel A).Two items (q34 and q37) had higher item difficulty measures for Kuwaiti than non-Kuwaiti examinees.Three items (q24, q20, and q82) were more difficult for non-Kuwaiti examinees.Next, five items showed DIF by gender affiliations, among which three items  (q34, q42, and q5) appeared to be significantly more difficult for males, and two items (q41 and q39) were more difficult for females (Table 2, panel B).Third, three items exhibited DIF between examinees with different majors in high school (Table 2, panel C).In particular, two items (q82 and q25) were more difficult for examinees who majored in humanities, and one item (q67) was more difficult for examinees in science majors.At last, many items (N = 30) showed DIF across subgroups defined by high school types.Among all, 16 items were more difficult for students from public schools, favoring examinees from private schools; while 14 items favored examinees from public schools and were more difficult for students from private schools (Fig. 4).

RQ4: Do KUEAT scores have the same prediction toward first-year college GPA across student subgroups?
The differential prediction analysis was conducted based on Sample 2 using moderated regression analysis.Specifically, we examine if KUEAT scores predict first-year college GPAs differentially between student subpopulation groups.The demographic variables for defining subgroups include gender, nationality, high school type, and high school major, and they were the moderators in regression analyses.Table 3 reports the moderation effect of each demographic variable.Figure 5 displays the regression lines for showing the relationship between KUEAT scores and GPAs for each subgroup.
Next, the regression analysis with the high school type as a moderator showed a significant interaction effect ( β 3 = .012,p < .001), indicating KUEAT scores pre- dicted first-year college GPA differentially between students from public and private high schools.As shown in Fig. 5C, the KUEAT scores negatively predicted college GPAs for students from public high schools ( β 1 = −.002,p < .001 ) but posi- tively predicted college GPAs for students who graduated from private high schools ( β 1 + β 3 = .009,p < .001).Lastly, KUEAT scores significantly predicted first-year GPAs for students who studied humanity majors ( β 1 = −.002,p < .001 ) but not for students in science majors ( β 1 + β 3 = .000,p = .937), shown in Fig. 5 (Panel D).The moderation effect by high school majors was significant ( β 3 = −.002,p < .01 ), indicating a differential prediction of KUEAT scores on their first-year college GPA between students in different majors.

Discussion
This study examined the fairness of KUEAT scores as an example of a high-stakes college entrance exam in Kuwait.This study is one of the first few attempts to comprehensively evaluate the fairness of high-stakes standardized testing in Kuwait.The fairness of KUEAT scores is evaluated through two aspects: (a) internal evidence of bias at the item level and (b) external evidence of bias at the test level.Our analysis results identified a few DIF items in KUEAT and showed that KUEAT scores predicted college performance differentially for different subpopulation groups.
We first performed an item analysis based on Rasch measurement theory to support the validity, reliability, and fairness arguments.The reliability of KUEAT scores was very high, indicating good consistency and replicability of the test items.Many items showed misfit to the constructed scale, particularly 12 misfit items based on Infit mean square statistics and 40 misfit items based on Outfit mean square statistics.In addition, the DIF analyses further revealed 38 out of 85 items, representing 45% of the entire item set, that may be biased against a particular student group.These pieces of evidence make the validity and fairness argument unwarranted and cause attention to the valid and fair use of KUEAT scores for making admission decisions.
The presence of DIF items can seriously affect the validity and comparability of the scores for intended uses and interpretations (AERA, APA and NCME, 2014).It is essential to note that statistical bias is insufficient to conclude a lack of fairness.Researchers have been investigating different reasons that may cause DIF, such as poorly formatted items, improper item content, or measuring an irrelevant construct (Gafni, 1991;O'Neill and McPeek, 1993;Pae, 2004).When the irrelevant construct is associated with the group membership, the scores would be against a particular student subpopulation (Zieky, 2016).The presence of DIF may also be related to latent group membership, e.g., students who speeded on a test (Cohen & Bolt, 2005;De Ayala et al., 2002).The existence of DIF may imply that the test measures an additional dimension, e.g., an integrated writing assessment may measure both reading and writing proficiency (Mazor et al., 1998;Roussos & Stout, 1996).For multiple-choice items, DIF may be due to that distractors were perceived differently by individuals from different subgroups (Suh & Bolt, 2011;Suh & Talley, 2015).For test developers and users, it is important to investigate the potential reasons for DIF and revise or remove the DIF items before getting into the operational use of the test.
The differential prediction analysis revealed external biases of KUEAT scores.The KUEAT scores predicted first-year college GPAs differentially between student subgroups.Specifically, there was a positive relationship (i.e., the higher KUEAT scores, the higher GPA) for examinees who were non-Kuwaiti, female, graduated from private high schools, or in humanity majors.For Kuwaiti examinees or those who graduated from public high schools, KUEAT scores negatively predicted their college GPAs.For male students and those in science majors, college GPAs were not significantly related to their KUEAT scores.A test should not be used for any purpose if it predicts the future performance differentially between subgroups (Meade & Fetzer, 2009).This differential prediction causes the fairness of scores in question and also makes the predictive validity of test scores unwarranted.
The existing literature revealed contradictory results in terms of the predictive validity of KUEAT scores (Eid, 2009;Shamsaldeen, 2019).In particular, Eid (2009) indicated that KUEAT scores did not predict college performance.However, Shamsaldeen (2019) found that KUEAT scores predicted second-year college GPAs significantly.Our study found that KUEAT scores predicted college performance differentially by different student subgroups.The evidence of predictive validity supports an overall or average relationship between KUEAT scores and college performance.When KUEAT scores predict GPAs differentially, an average relationship cannot represent all student subgroups.We suggest to first examine whether a test shows differential prediction.When KUEAT scores predict all student subpopulation groups in the same manner, supporting the fairness argument, it is more reasonable and meaningful to discuss the predictive validity of scores.
The differential prediction of KUEAT scores may be due to the differential validity of test scores (Meade & Fetzer, 2009;Sackett & Wilk, 1994), the presence of DIF items, measurement issues with criterion variables (Berry, 2015), or contextual influences such as stereotype threat (Steele & Aronson, 1995).Mattern et al. (2017) found that the inclusion of academic discipline in college affected the results of differential prediction analysis.The relationship between KUEAT scores and GPAs varied across different academic disciplines in college (Light et al., 1987).Different majors in college require different levels of English proficiency.In some majors, students can be successful in college regardless of their English levels, which may weaken the association between language proficiency and academic achievement (Cotton & Conrow, 1998).Policymakers and administrators may consider assigning different weights to English test scores in admission decisions for different college majors.
Lastly, there are two limitations in this study.First, there was a restricted range issue with first-year college GPAs, which may underestimate the regression coefficients and the effect size between the predictor and the criterion variables.However, the correction for the range restriction requires the information of population parameters or estimates from similar studies.As neither is available, we could not conduct the correction.Second, since the test items remain confidential and were not provided to us, we could not perform a substantive review of the item content to further explore the reasons that cause DIF.Content experts should review the DIF items to identify possible sources of DIF and provide "explainable sources of bias" for removing or revising any DIF item (Pedrajita, 2011).Similarly, since we do not have access to individual student records, a further examination of differential prediction cannot be conducted.We would suggest the test developers and users consider fairness issues and explore factors that may lead to bias in item scores and test scores against a particular subpopulation group.

Conclusion
This study examined the internal and external biases of KUEAT scores to support the fairness argument.Results indicated differential item functioning at the item level and differential prediction of college GPAs at the test level.Although KUEAT scores established high reliability, the validity and fairness of scores were not supported.We recommend a substantive review of test items and establish a routine examination of the psychometric quality based on modern measurement theory.A well-developed standardized test can be beneficial as a complement to high school GPAs for making college admission decisions.

Fig. 1
Fig. 1 Internal and external evidence of psychometric quality

Fig. 3
Fig. 3 Frequency distribution of outfit and infit mean square statistics.Note.MNSQ -mean square statistic.The red lines indicate the acceptable range fit values that are between 0.8 and 1.2

Fig. 4
Fig. 4 Flagged Items for differential item functioning by high school type

Table 1
Demographic characteristics of data samples

Table 2
Differential item functioning items by nationality, gender, and high school major

Table 3
Moderated regression models with demographic variablesKUEAT Kuwait University English Aptitude Test, β 3 the estimated regression coefficient for the interaction effect in each regression model, Std.err standard error for β 3 , 95% CI 95% confidence interval for β 3