A novel technique of detecting DIF in items using z-score based on IRT model

A novel technique of detecting DIF in items using z-score based on IRT model. A simple technique based on z-score is proposed and derived for Rasch Model and also compared with traditional DIF detection methods. The data set used for application is the State Level Achievement Survey data set from West Bengal, India. Binary responses on 40-item tests of 12518 examinees are taken.


Introduction
An item shows differential item functioning (DIF) if the item's psychometric behaviour varies in different groups.It refers to the phenomenon where the estimates of item characteristics are different for some predefined (which was not intended by the test) groups viz.focal and reference, having similar ability level of test takers.These groups may differ by the examinee's geographic region, socio-economic status, demographic profile or any other attribute.For comparing the item parameter estimates of the groups it is essential to match the underlying latent ability scale of the focal and reference groups.For example, an item showing DIF in difficulty parameter, may appear easier to one group of individuals but may seem to be difficult for the other group, if they are matched on ability continuum.In other words, an item is detected with DIF in the context of an item response theory, if the probability of giving a correct answer is different for groups of examinees with equal ability, estimated by the item response function of an IRT model.
To assess the differential behaviour of the items, a test is administered to two groups viz.focal and reference group.The focal group represents the minority group of the population, where focus of interest of the study lies and generally it represents the less advantageous portion of the population.
A reference group, a majority of the population is compared with the focal group for identifying differential item functioning in a test situation (Benito et. al., 2010).If the nature of the difference in item characteristics (and subsequently the probability of correct response) between the focal and reference group is similar through out the continuum of ability, the corresponding differential item functioning is called the 'uniform' DIF.On the contrary, when an item shows different nature of DIF in say, lower and higher ability group, it is identified as an item having 'non-uniform' DIF.
There exists several methods for detecting DIF in items in the literature (Dorans & Kulick, 1986;Shealy & Stout, 1993;Holland & Thayer, 1988;Swaminathan & Rogers, 1990).One of the commonly used methods of DIF detection is the Mantel-Haenszel (MH) procedure (Holland & Thayer, 1988) and many extensions of this procedure took place in recent years (Parshall & Miller, 1995;Penfield, 2001;Zwick et. al., 1997Zwick et. al., , 2000)).There are several methods for detection of DIF in unidi-mensional and multidimensional IRT models as well (Magis, Tuerlinckx & Boeck,2015;Svetina & Rutkowski,2014;Wiberg,2007).Tay, Huang & Vermunt (2015) discussed item response theory with covariates (IRT-C) method for identifying DIF in IRT three parameter logistic model.The advantage of this approach is that it estimates DIF across multiple variables simultaneously in large scale testing where there are large sample sizes and long test lengths.Camilli (1992) described DIF as a shift in the ability distribution of examinees along a secondary trait (such as test-wiseness) instead of assuming DIF as a difference in the item parameters of two groups and provided a conceptual analysis of DIF (a characteristic of an examinee, not an item) with respect to a multidimensional IRT model.Ozdemir (2015) used Lord's Chi-square, Raju's area (Raju, 1988) and Likelihood-Ratio test methods to detect DIF in TIMSS 2011 mathematics subsets and compared the results.Pedrajita (2015) demonstrated differential item functioning (DIF) analysis using contingency table approaches and carried out comparative analysis of the results from different methods.DeMars (2015) conceptualized DIF as a secondary trait and simulation study was conducted using both unidimensional and multidimensional models.In another study, differential item functioning and differential bundle functioning (DBF) were explored by Latifi et.al, 2016, using two independent procedures in light of a data set on national Secondary School Certificate (SSC) examinations in Pakistan.
The present study proposes a novel approach for detection of DIF in items based on item response theory.The DIF statistic proposed in the study, based on z-score, is easy to calculate and interpret.
The methodology is described in the next section.To assess the performance of the novel technique, a detailed simulation study is carried out for different simulating conditions.Analysis of power and type-I error for the method of detection of DIF in items using z-score statistic is discussed.The method is applied to a large scale assessment survey data set and results are compared with two popular method of DIF detection viz.Mantel-Haenszel procedure (Mantel and Haenszel, 1959) and Lord's χ 2 (Lord, 1980) statistic.

Methodology
This study adopts a simple method for detecting differential item functioning in items based on Zscores.The methodology is described with a one-parameter item response theory model namely Rasch model (Rasch, 1960(Rasch, , 1961)).However the method can easily be extended for two parameter or three parameter models.Using the notation of Rasch model used by Reckase (2009), the probability of correct response in an item i, for an examinee with an ability vector θ, is given by, where, u ij is the binary (correct or incorrect, i.e. 1 or 0) score of an individual j on item i, θ j represents the ability level, a relevant characteristic of the test-taker j required to respond to the items and b i is a measure of difficulty, an item characteristic.
An item is identified as having DIF when examinees or test-takers with the same latent ability but from different observed subgroups have unequal probabilities of giving a correct response.The observed subgroups (i.e.focal and reference) may be defined by socio-economic profile, geographic region, gender of examinees or by any other characteristics based on the context of a study.An efficient DIF statistic should measure the difference (if any) in item parameters for these groups in the same ability level.A novel approach for detecting DIF in items based on Z-scores of item characteristics estimated by the Rasch model, is adopted in this study, which is easy to calculate and interpret.To obtain the statistic, the difference (in absolute terms) of the item parameter in a Rasch model in focal and reference groups (the difficulty b i , with subscripts f and r for focal and reference groups respectively) are calculated for each item 'i' as the following, and where, Z is the DIF statistics, Mean and SD are the mean and standard deviation of the differences obtained from all items in a test.Larger the values of the Z-scores, larger the extent of DIF the items have.The Z-statistic follows a Standard Normal Distribution (following the Central Limit Theorem).
The test of significance for the null hypothesis that there are no differences in item parameters in focal and reference groups i.e.DIF is not present in an item, is carried out at 5% level of significance.The difference in the item parameter estimates for focal and reference groups are obtained for all the items in a test.If the null hypothesis is rejected, the corresponding item is considered to have DIF in it.

Simulation study
The simulation study is conducted for comparing the true positive and false positive rates of the proposed DIF detection procedure in different situations.A true positive rate (i.e.power) is defined by the percentage of correctly identified DIF items in a pool of item and a false positive rate (i.e. type-I error) is defined by the percentage of incorrect identification.Person ability parameter (θ) is generated using an Uniform(-2,2) distribution and item parameter is generated using Normal(1.22,0.7).The Rasch model is used to simulate item response data using the person and item parameters specified above.The data are generated using no group ability differences in focal and reference group.A matrix of binary (correct or incorrect) item response pattern is generated for the following conditions: 1. Sample sizes of Focal and Reference Groups are a) 1000 and 1000, b) 1000 and 1500, c) 1500 and 1000.
2. Number of items in the test : 30 and 60.
To control for random differences in difficulty estimates over focal and reference groups, the estimation process is replicated 100 times.The power and type-I errors are calculated for each condition across all replications.The level of significance for testing the hypothesis of no DIF is specified as 0.05.It is evident from the results of simulation study presented in table 3.1 that the three conditions of simulation have different impact on the power and type I error rates of the procedure.It has been observed that the increase in sample size resulted in increased true positive rates.A lengthier question booklet would have more power, only in case of unequal size of focal and reference groups.Percentage of items have differential item functioning in the test has significant impact on the power and type-I error of the DIF detection procedure.The true positive rates are higher when the total number of differentially functioning items are large (increased from 10% to 20%), irrespective of the level of other conditions viz.sample size and test lengths.

Results and Discussion
The present study describes a unique approach for detecting differential item functioning in a pool of items or a test.The novel technique for identification of items having DIF using z-score presented in this study is an IRT based DIF detection method.The methodology is simple to calculate and easy ) by Mantel-Haenszel procedure q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 10 20 30 40 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 10 20 30 40 0 ) by Lord's χ 2 procedure q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 10 20 30 40 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 10 20 30 40 0  3.1) that the proportion of correctly identified items as having DIF increases when the number of DIF-items in a test is increased.For a 60-item test, the type-I error or proportion of false positive (i.e. when an unbiased item is wrongly identified as an item having DIF) ranges between 5 to 7%.The maximum power is achieved at 95% for a 60-item test with 12 items having DIF and 2500 sample size.For eliminating random group differences, the z-scores are obtained for each cycle of simulating condition, replicated 100 times.The test of significance, with the null hypothesis of 'no group differences' is performed at 5% level of significance.No difference in ability (person characteristic) distribution in focal and reference group is assumed for the simulation study.
The z-score method of DIF detection is further evaluated using a data set of a state level large scale achievement survey, carried out in West Bengal, India in 2015.The data set contains binary responses of 12518 examinees on a 40-item test (set A only) on Science and Environment.Two sets of focal and reference groups are identified for the analysis: one is based on the gender of the examinee and the other is based on the geographic location of the examinee.Before applying the DIF analysis, a comparative analysis of total score obtained by each examinee in the test is performed.To know, whether the focal and reference groups had any observed differences in performance of the examinees in terms of the total score, a t-test is performed with the null hypothesis that there is not any group difference in total score between male or female and rural or urban examinees.A significant group difference is observed in the gender-wise classification of test-takers, however, there is not enough evidence of such difference between rural and urban test-takers.The DIF analysis is performed to detect any group difference between male or female and rural or urban examinees in test performance and it identifies the items which have significantly different characteristics (in terms if difficulty) in different groups of examinees, controlled for the ability level.The items which are identified through z-score, are compared with the items identified by the classical techniques of Mantel-Haenszel (MH, method, are all included in the sets of items flagged as having DIF by the two standard procedures mentioned above.Hence the data analysis achieves 100% sensitivity of the z-score method, keeping the two standard methods as gold standard.However, it can also be observed that the items, which show higher magnitude of DIF according to the MH and χ 2 method, are identified as DIF-items by the proposed z-score method.The efficacy of the method for identifying low and moderate DIF items may also be investigated in future.The efficacy of the method of z-score is based on the model of item response theory chosen for calibrating the item and person characteristics in a test situation.An exploratory analysis of IRT models may be carried out to identify the best suited model in a particular situation and the DIF analysis using z-score may be performed thereafter to identify items behaving differentially over focal and reference groups.In this study, a confirmatory approach using Rasch model, a one-parameter IRT model, is adopted to investigate and compare the performance of the DIF statistic in different test situations.The analysis may be extended further for IRT models with higher number of parameters.The distribution of the test statistic is asymptotically normal (by the central limit theorem) and the method is efficient for large enough sample sizes.The performance of the technique in small sample size is also an area of future exploration.

Figure 4 . 1 :
Figure 4.1: Detection of DIF in SLAS 2015 data set (gender-wise classification of focal and reference group) by Mantel-Haenszel procedure

Figure 4 . 2 :
Figure 4.2: Detection of DIF in SLAS 2015 data set (location-wise classification of focal and reference group) by Mantel-Haenszel procedure

Figure 4 . 3 :
Figure 4.3: Detection of DIF in SLAS 2015 data set (gender-wise classification of focal and reference group) by Lord's χ 2 procedure

Figure 4 . 4 :
Figure 4.4: Detection of DIF in SLAS 2015 data set (location-wise classification of focal and reference group) by by Lord's χ 2 For the sake of simplicity, a Rasch model (i.e. one parameter IRT model) is taken for obtaining the values of the DIF statistic proposed in this study.However it can be extended for higher order models as well.To assess the performance of the novel technique, simulation study and data analysis are carried out.Data sets of sizes 2000 and 2500 are generated using specific distributions of item and person parameters.The simulation is performed using different conditions based on the number of items in a test, sample size of the examinees and the proportion of DIF items in the pool of items.It is observed (from table

Figure 5 . 1 :Figure 5 . 2 :
Figure 5.1: Items identified as having significant DIF using z-score statistic in gender-wise classification of SLAS 2015 data set

Table 3 .
1: Power and type-I error rates of detection of DIF using z-score statistic in different conditions of simulation study with no An empirical analysis of the DIF detection procedure using z-score using the State Level Achievement Survey (SLAS), "Utkarsha Abhijan" (UA, conducted in West Bengal, India, for monitoring improvement in learning level of children enrolled in the Government education system, can be found in Dey & Dey (2019).The present study uses this data set for detection of items having DIF by the z-score in pre-defined focal and reference groups.A report for UA 2015 was prepared by "Paschim Banga Sarva Shiksha Mission" (PBSSM) and UNICEF Kolkata, West Bengal.UA 2015 was conthe examinees from rural area belong to the focal group (sample size = 10249) and examinees from urban area belong to the reference group (sample size = 2269).The total scores are calculated as the aggregated score of all examinees in the two classifications according to gender and location.The descriptive measures of the score distribution in different classes of examinees are given in the table 4.1.The group difference in mean scores of examinees in focal and reference groups ducted on state specific curriculum and syllabus.This achievement survey included Language and Mathematics for Grade III and Language, Mathematics and Environment & Science (EVS) for Grade VII.The report, prepared by the authorities for UA 2015 in respect of EVS mentioned that standardised test tools were designed in two sets (i.e Set A and Set B) with 40 items by random shuffling the order of the items.The tests were administered across 20 districts of West Bengal in the sampled Government/Government Aided Bengali Medium schools.This section presents DIF analysis of 12518 examinees in Grade VII who have taken test on Environment & Science (EVS) in question set A with 40 items.The examinees are classified into two sets of focal and reference groups viz.according to Gender (Male or Female) and Location (Urban or Rural).In gender-wise classification, the focal group consists of the male examinees (sample size is 5619) and the reference group consists of the female examinees (sample size is 6899).In locationwise classification, are tested with a one sample t-test (alternative hypothesis: group difference is non-null) at 5% level of significance.The z-score statistic, used in this section for detecting DIF in items, is based on the Rasch model, which has one item parameter viz.difficulty.DIF in items are also detected by older

Table 4 .
1: Descriptive measures of total scores of examinees in focal and reference groups techniques of Mantel-Haenszel and Lord's χ 2 are compared with the results obtained from z-score statistic.The Figure 4.1-4.4show the items which are flagged as having DIF (in red) by these two procedures, in two sets of classification.The z-score statistic (DIF.z as in figure 5.1 and 5.2) is used to detect items having DIF in SLAS 2015 test booklet.The plot of z-scores of each item is presented in figure 5.1 for gender-wise classification and in figure 5.2 for location-wise classification.The items functioning differentially in the item response model is plotted in red.Three items (2, 15 and 19) are identified as having differential characteristics between the focal and reference groups according to gender.However, two items (12 and 40) are identified as having differential characteristics between the focal and reference groups according to location.The content area of these items are mentioned in the table 6.1.The difficulty estimates of these items, as was obtained from the Rasch model, are -0.35,1.23, 0.76, 0.64 and -0.65 for the items 2, 15, 19, 12 and 40 respectively (figure 2.1).The difficulty levels of the flagged items are more or less low to moderate except for item 40.Thus, no matter how less difficult an item is, a DIF may occur in the item for a pre-defined focal or reference groups.

Table 6 .
1: Classification of items according to content area and discipline part of a plant on the basis of its properties viz. it enters in the earth from the seed, there are no nodes in it, it has a cover at one end and thread like structures at the other.