Differential Item Functioning: Item Level Analysis of TIMSS Mathematics Test Items Using Australian and Indonesian Database

The Trends in International Mathematics and Science Study (TIMSS) aims to provide a broad perspective for evaluating and improving education. This assessment also ranks the participant countries based on their performance and makes inferences about factors affecting achievement and learning. However, the study may not function as it was expected because of differences in curricular, cultural, or language settings among countries. Consequently, this challenges assumptions about measurement equivalency. The present study aims to assess the equivalency of mathematics items on the TIMSS (2007) study across Australian and Indonesia. Students' responses were subjected to Rasch analysis to determine DIF items. The results revealed that many items of mathematics tests are problematic because they showed significant bias. The study also found that Australian students performed better and found mathematics items on the test easier than their Indonesian counterparts did. Several factors such as curricular differences, methods used to solve mathematics problems, availability of textbooks and teachers' quality might explain the existence of DIF between the countries. These findings indicate that serious limitations of using TIMSS results in comparing the performance of students across countries. Thus, further empirical evidence is needed before TIMSS 2007 results can be meaningfully used in research.


Introduction
One of the major developments in mathematics education is the growing interest in international comparisons of student achievement. International comparative studies, such as the Trend in International Mathematics and Science Study (TIMSS) and the Programme for International Students Assessment (PISA) were implemented decades ago. TIMSS is an ambitious series of international assessments conducted in nearly 60 countries to measure trends in learning mathematics and science (IEA, 2008). Since the 1960s, this cross-cultural study has been conducted, based on the idea that this assessment can provide a broad perspective for evaluating and improving education. In addition, the participant countries can assess their relative positions in mathematics achievement in relation to their competitors in the global world. Analyzing the data collected from this large-scale comparative study of mathematics achievement may enable us to understand educational processes and to identify new issues relevant to reform movements in the educational system. In addition, analysis within and across countries may determine the link among students' achievement, teachers' instructional practice, and curriculum content. This information then can be used to guide educational decision-making and practice in the area of mathematics (IEA, 2008).
However, to be able to meet the objectives stated above, it is clear that international studies need to confirm the validity and reliability of the test (Wu, 2009). This is urgent because international studies, such as TIMSS, originally used test instruments in English, which then were translated into the language of instruction of the students. Many researchers have argued that adapted tests should possess adequate validity and reliability within each language in order to make valid comparisons across these groups of students (Sireci & Gonzales, 2003;Yildirim, 2006;Chen, Gorin, Thomson, & Tatsuoka, 2008;Wu, 2009). Therefore, the present study on test adaptation meets this need.
Related to test adaptation, the TIMSS (2007) study administered tests in 39 different languages in 59 participating countries. Although TIMSS (2007) implemented rigorous translation verification to achieve maximal linguistic equivalence and to set test items that are simple and context free (IEA, 2008), the test instruments may not function in the same way in all cultures because of differences in curricular, cultural, or language settings among the countries (Sireci & Gonzales, 2003;Ercikan & Koh, 2005;Schulz & Fraillon, 2009;Yildirim 2006;Arim & Ercikan, 2014). Consequently, this international test may not function as expected. Hence, the test may not be equivalent or fair among different cultures. According to Gierl (2000: 281), 'if the construct measured by the two forms is not equivalent, it may change the validity for one set of test scores and adversely influence their comparability, meaning, and interpretability'. Hence, the validity of the score of any translated achievement tests depends on the accuracy of test adaptation, indicating the need for the evaluation of test equivalents to achieve valid test adaptation.
The issues of validity and reliability can be defined from multidimensional perspectives. That is, in the case of international assessment, different groups of participants may have differently distributed multidimensional ability because of differences in language, culture, and curriculum (Ercikan, 1998;Byrne, 2002;Arim & Ercikan, 2014). These differences may cause a test item to function differently between two groups. It has been argued that when test items exhibit Differentiate Item Functioning (DIF), the validity and reliability of the test are not yet achieved (Wu, 2009;Arim & Ercikan, 2014). It is believed that this may affect the equivalence or non-equivalence of the test items. Therefore, the investigation of DIF is required to assure the validity and reliability of the assessment.
Many international comparative studies have been conducted to determine the existence of DIF. For example, Ercikan (1999) reported that 41% of science items from TIMSS displayed moderate or large DIF when Canadian English and French examinees were compared. She also found that 18% of mathematics test items exhibited DIF. Allalouf, Hambelton, and Sireci (1999) found that 42 of 125 verbal items (34%) displayed moderate or large DIF in the Israeli Psychometric Entrance Test when Hebrew and Russian examinees were compared. Yildirim (2006) assessed the Turkish and English versions of TIMSS 1999 and found that the rate of DIF items within the test was high and differential discriminating was an issue. Arim and Ercikan (2014) also found that approximately 23% of mathematics items in a TIMSS (1999) study were identified as functioning differentially in American and Turkish versions. However, few studies have focused on Australian and Indonesian data. Such studies are urgently needed because tests that were administered in both countries were written in different languages. Australian students were tested in the source language of English, whereas Indonesian students were tested in the Bahasa Indonesia version adapted from the source language. DIF-related problems may appear during the process of test translation and adaptation between the languages of both groups. Investigation of the equivalence of English and Bahasa Indonesia versions, in the context of cultural differences, can be minimized. In addition, the performance of students in the eighth grade students in both countries is below the international average (500). DIF analysis may provide some information about the difficulty of test items faced by students in both countries. Therefore, the aim of this study is to conduct an item-level analysis, in which the test items are investigated through utilizing the DIF method.
Because valid and reliable assessments are not easy to develop (Wu, 2010), the main purpose of this study is to examine the equivalence of mathematics items in TIMSS (2007) across cultures and languages. This study also provides an overview of statistical methods that can be employed to assess flaws in the items caused by test translation in the context of mathematics achievement testing. Several DIF methods seek evidence of the differential performance of subgroups, in order to detect biases. These include item response theory with Rasch model analysis (Hungi, 2005); item response theory with likelihood ratio analysis (IRT-LR) (Yildirim, 2006); and the Mantel-Haenszel (M-H) technique (Yildirim, 2006;Gierl & Khaliq, 2001). However, this study employed only item response theory with Rasch model analysis. The reason that this method was selected is explained in the methods section of this paper.
The current study addresses the following research question: "Do the mathematics items of TIMSS (2007) operate differently between Australian and Indonesian students?" For this purpose, the study will assess responses to Indonesian and Australian TIMSS (2007) mathematics items with respect to the psychometric characteristics of the items. Because this study evaluates the possible presence of item bias caused by test translation, the results of such analyses should provide information that is useful in understanding how differences in items may relate to educational differences across countries. In short, the results of these analyses then might provide some insights into the reasonableness of the assumption that TIMSS (2007) mathematics items are equivalent and fair across countries. Based on previous research on test adaption and test translation within international comparative studies and the appearance of DIF during that process, it is hypothesized that mathematics test items administered for Australian and Indonesian students may function differently.

Methods
This study used the TIMSS (2007) mathematics achievement test. A dataset of the test is publicly available on the International Association for the Evaluation of Educational Achievement (IEA) website. The test consists of numerous items designed to collect information about the mathematical ability of students. There are 63 number items, 64 algebra items, 47 geometry items, and 41 data and chance items, which is a total of 215 items. The subjects under the four content areas were as follows: Number area includes whole numbers, fractions and decimals, integers, ratios, proportions and percentage. The algebra areas include patterns, algebraic expressions, equations and formulae, and functions. This included three subject areas of geometry: geometric shapes, geometric measurements, location, and movement. Finally, the section on data and chance included data organization and representation, as well as data interpretation and chance. All aspects of the test content represent the subject matter of school mathematics that is covered by the eighth-grade curriculum in both Australia and Indonesia.
Of 215 items, 81 were classified as measuring knowledge, 88 as measuring application, and 46 as measuring reasoning skills. More than half the items (117) were multiple-choice and the rest (98) were constructed responses (CR) that required students to generate and write their own answers. These mathematics items then were matrix sampled into fourteen booklets. The pool of items was divided into 28 sets of items or cluster. These were then arranged variously to make 14 overlapping test booklets, which were distributed systematically in each classroom. The examinees were administered one of the 14 test booklets.
This present study investigates two booklets-the Booklet 8 and Booklet 9. These booklets were selected because they contain a higher number of test items than the other booklets do, so more items would be investigated. The number of TIMSS (2007) mathematics items by type and reporting category in these booklets is given in Table 1. Thus, the number of possible score points available for the analysis exceeded the number of items, whereas the total score for Booklet 8 and Booklet 9 were 33 and 32, respectively.
For the purposes of this study, a total of 1,178 grade 8 students were included across the two booklets: 578 were Australian students, and 600 were Indonesian students.
The examinees were administered one of the two test booklets (Booklet 8 or Booklet 9). The Australian students were tested in the source language of English, whereas the Indonesian students were tested in the Bahasa Indonesia version that was adapted from the source language. The selection of these countries allowed for the investigation of the equivalence of English and Bahasa Indonesia versions when cultural differences were expected to be minimal.
Because many countries, cultures, and language backgrounds were involved in the TIMSS (2007) study, test adaptations play an important role. Hence, TIMSS (2007) followed strict verification procedures to ensure translation equivalence. These procedures were also used to minimize semantic, psychometric, and linguistic differences between the source and translated language versions of the test. TIMSS (2007) instruments were developed in English and then translated into 39 other languages, by following a complex verification procedure of translation and adaptation appropriate for the cultural contexts of participating countries. Professional translators and subject matter experts were involved in ensuring that the meaning and the difficulty of items did not change between the source and target versions. Additionally, a series of statistical checks to detect differences in the performance of the items were conducted (IEA, 2008). A double translation procedure was also used in TIMSS (2007) to ensure that the materials were equivalent across language versions.
Because descriptions of data procedures and rationales for selecting sub-groups of item were given, some statistical and judgmental procedures used in the analyses were also defined. Item response theory with the Rasch Model approach was used in the DIF analyses of the items selected in this study. The Rasch model (Rasch, 1960) was used to determine the equivalence of the test items, particularly in the item-level analysis.
The justification for using this model is that Rasch modeling is widely used to measure invariance and determine equivalence across groups of items (Schulz & Fraillon, 2009). Additionally, the Rasch model proposes that responses to a set of items can be explained by a person's ability along a continuum of the unidimensional construct underlying the items and by the characteristics of the items, or item parameters. Several advantages of Rasch measurement have been described (Andrich, 1988;Wright, 1997). A key characteristic of the model is that Rasch measurement can be considered sample independent, as well as instrument independent. That is, if a Rasch model fits a set of data, item characteristics are not dependent upon a specific sample; therefore, item parameters estimated across different groups and contexts will be equivalent (Andrich, 1988). Consequently, the Rasch model can be used to assess the extent to which a set of test items is sample-or context-free (Raczek et al., 1998). Rasch procedures also enable the test developer to examine the equivalence of item calibrations across different samples and contexts, including various cultural-linguistic settings and translations. In this case, the Rasch analysis enables a more detailed (item level) examination of the structure and operation of the scales on the tests.
Within Rasch model, DIF analysis will be employed to investigate the items that operate differently across Australian and Indonesian groups. To perform this analysis, the data of mathematics achievement tests from Australian and Indonesia student data set were subjected to Rasch analysis using Conquest 2.0 software (Wu, Adam, Wilson, & Handale, 2007). Inspecting the infit mean squares (IMS) provides evidence of the fit of the data to the model. The infit mean squares are used to determine the fit of the item within the construct. In this study, critical values chosen for the IMS fit statistic were 0.72-1.30 (Linacre, Wright, Gustafsson, & Martin-Lof, 1994). Items where IMS values fall above 1.30 are generally considered misfitting and do not discriminate well, while those below 0.72 are overfitting and provide redundant information (Tilahun, 2004). Additionally, various statistics and probability curves were also used to judge the results. For instance, parameters were estimated separately for each group to determine whether the underlying model fit the data. If the given indicators are equivalent across groups, item bias is not supported (Little, 1997). In detecting biased items, the item threshold approach was also used. As suggested by Hungi (2005), two criteria in this approach are as follows: a) Items whose differences in threshold (estimate mean) values between two groups are outside a predetermined range. The range is d 1 -d 2 >±0.50 where: d 1 =the item's threshold value in group 1, and d 2 =the item's threshold value in group 2. b) Items whose difference in the standardized item threshold between any of the group fall outside a predefined range. Adam and Khoo (1993) employed the range -2.00 to 2.00: st (d 1 -d 2 )>±2.00

Results and Discussion
Descriptive summary. Because this study used secondary data, it is important to show the descriptive statistics of the data to describe their condition. Table 2 shows the scale statistics for selected booklets of TIMSS (2007). The results indicate that Australian students performed significantly better than the Indonesian students did in Booklet 8. Although the Australian students also performed better in Booklet 9 than the Indonesian students did, two independent t-tests were conducted to compare the mean analysis, showing that the differences were not significant. In addition, the score distribution for the Indonesian students was found more slightly skewed (1.404) and (1.054) than that of the Australian students ( Country differences. DIF analysis was used to investigate the presence of item bias and the significant differences between Australian and Indonesian groups. The number of items in the two selected booklets was subjected to analysis. Two criteria were applied to determine the biased items, which were based on IMS values and significant differences in threshold. Two separate analyses were conducted, and the results of each analysis are presented in the following sub-section.  Because these items did not fit the models of either the Australian or the Indonesian group, they were identified as bad items, indicating that the inclusion of these items on the test should be reconsidered. Thus, based on the criterion of item IMS, the results showed that country bias was a problem in the TIMSS (2007) mathematics tests.
Examining the items based on significant differences is also important in determining the existence of DIF within the group. The results in Table 3 show that the Australian students generally performed better and found the items in Booklet 8 relatively easier than the Indonesian students did. These scales were derived from standardized math score (50, 10)  The results also showed that the Australian students scored 1.150 lower than the Indonesian students did. The fact that the parameter estimate is more than twice its standard error indicates that this difference is statistically significant (Wu et al., 2007). The significant variance within the items is shown in Table 3.
The negative value of difference in item estimate (d 1d 2 ), as shown in Table 3, indicates that the item was relatively easier for the Australian students than for the Indonesian students, while positive values implied the opposite. Using this criterion, the analysis found that most items in Booklet 8 apparently favored one group or the other. However, it is important to remember that a mere difference between the estimate values of an item for the Australian and Indonesian groups may not be sufficient evidence to imply bias for or against a particular group. Nevertheless, a difference in item estimates outside the ±0.50 range is large enough to raise a concern. Similarly, differences in standardized difference in item threshold outside the ±2.00 range should raise a concern (Adam & Khoo, 1993;Hungi, 2005). Using this criterion, it is important to note that the standardized DIF for the last item could not be calculated. The standard error of this item was not estimated because the last item was fixed to the average difficulty equal to 0. Therefore, the last item was judged only according to the difference between the groups (d 1 -d 2 ). This case was also applied in each country's DIF analysis of each booklet in this study.
From the above criteria, 20 items were identified as DIF items because they fell outside the predefined ranges (d 1 -d 2 >±0.50; and st (d 1 -d 2 )>±2.00). It was found that 10 items (m042060, m042066, m042019, m042243, m042229b, m042224, m032064, m032100, m032734, and m032132) were markedly easier for the Australian students compared to the Indonesian students. On the other hand, 10 items (m042023, m042080b, m042203, m042255, m032094, m032419, m032538, m032324, m032116, and m032402, were markedly easier for the Indonesian students compared to the Australians students. These items are somewhat problematic because significant variance found in them. show that the item characteristic curves (ICC) for Australian students are clearly higher than those of the Indonesians, which means that the Australian students stood greater chances than Indonesian students of getting this item correct at the same ability level. On the contrary, the ICC for Indonesian students for item m042080b (Figure 3) was mostly higher than that of the Australian students. Based on this evidence, it can be concluded that country bias was an issue in Booklet 8. Country differences in Booklet 9. The DIF analysis was also carried out to examine Booklet 9. The results of the analysis of the 31 items in this booklet, for the examinees in each group, are summarized in Tables 5 and 6. As Table 6 shows, three items appear misfitting or not discriminating well in both groups because their IMS values-m032662 (1.31; 1.64); m042198c (0.67; 0.64); and m042169b (1.35; 1.50)-were outside the acceptable range.
The IMS values in the Australian group also showed that three other items-m03232 (1.65), m042198a (0.63), m042260 (0.70)-fell outside the range (0.72-1.30). However, these items behaved well when the model was fitted to the Indonesian group. Their IMS values-m03232 (1.05), m042198a (0.91), and m042260 (1.29)-fell within the range, indicating that the items fit the model of the Indonesian group. In contrast, the analysis of the IMS values in the Indonesian group found that three items did not fit the model of this group, but they fit the model of the Australian group. This is because the IMS value of the items in the Indonesian group-m032064 (0.59), m032477 (1.39), and m042300b (0.61)-fell outside the predetermined range, while the Australian group recoded the IMS values of m032064 (0.74), m032477 (0.86), and m042300b (0.89) within the range. These results indicate that these items are somewhat problematic. Thus, based on the IMS criterion, it is evident that there is a country bias in Booklet 9.
The significant DIF of the items was investigated using the threshold approach. Table 5 shows that 23 items in Booklet 9 showed significant DIF. This can be seen in the differences in the threshold values of these items, which were bigger than ±0.50, and the standardized difference values of the items were also bigger than ±2.00. In addition, 10 of these items were biased in favor of Australian students, which was indicated by the negative values of the difference in item threshold. On other hand, 13 items were biased in favor of the Indonesian students, which was indicated by the positive values of difference in item threshold. These results indicate a significant variance in this item, which is evidence of DIF. Thus, the results showed that most of the test items in Booklet 9 were biased against one group or the other.
The big gap in performance between the students in the two countries is shown in plot ICC of the items that exhibited significant DIF. The plot is illustrated in the Figures 4 and 5. Figure 4 shows that, given a particular ability level, the probability of being successful on this item is higher for Australian students than for Indonesian students, which indicates that the Australian students found this item easier than the Indonesian students did.
However, as shown in in Figure 5, the probability of being successful on this item was higher for the Indonesian students than for the Australians students because both groups were at the same ability level. The Indonesian students found this item easier than Australian students did. Many items Booklet 9 seem somewhat problematic. Therefore, it can be concluded that country bias was a concern in Booklet 9 of the TIMSS (2007) mathematics test.
In this study, the big difference in ability between the Australian and Indonesian groups in the mathematics tests of TIMSS 2007 could be explained by curriculum difference. Although this study did not investigate the degree to which DIF may be caused by curriculum difference, some evidence from the relative distribution of DIF items by content areas in each booklet indicated that some DIF items were affected by curriculum differences (Ercikan, 2002;Ercikan & Koh, 2005;Emenugo & Child, 2005;Yildirim, 2006). These differences include the sequence of mathematics courses or time spent on the topic, teacher classroom practice influenced by teacher academic training, experience, and the material available to them (Emenugo & Child, 2005).
It is assumed that this problem might also exist in the Australian and Indonesian contexts because the mathematics curricula in both countries are different. Therefore, further studies that investigate bias must be carried out, as suggested by Yildirim, Yildirim and Verheslt (2014), who said that when DIF items were detected in the test instrument, the researchers should conduct studies to determine the possible cause of DIF detected in those items.
The relative failure of Indonesian students in achieving most items on the TIMSS (2007), with respect to the Australian students, could be attributed to the ineffectiveness of the curriculum and instructional practices in Indonesia or the limited textbooks or other sources in most Indonesian schools to support student learning. Chi-square test of parameter equality = 180.60, df=1, Sig Level=0.000 IMS: Infit mean square; CI: Confidence Interval (the estimate will vary from lower to higher values); T: Ratio between the estimate and its standard errors (if ‫|‬t‫=2±>|‬estimate is significantly differ from 0)  This assumption is in line with the findings of some studies that documented the teaching strategy used by Indonesian mathematics teachers as a factor contributing factor to this failure (Hadi, 2004;Widjaya & Heck, 2003;Zakaria, Solfitri, Daud, & Abidin, 2013).
Consequently, this may affect Indonesian students' performance on the constructed response (CR) items of TIMSS (2007), which requires students to communicate mathematically (providing explanations and reasoning), to compare various results, and to understand the realworld context.
Another reason for the big difference in ability between the Australian and Indonesian groups in the mathematics tests of TIMSS (2007) is low teacher qualification. A survey of teacher quality conducted by the World Bank (2005) showed that the preparation and attendance of teachers are inadequate. Unlike many other countries, Indonesia allows graduates of all teacher-training institutes to become teachers without checking their preparedness to impart knowledge and skills under various school conditions. The survey also found that 20% of Indonesian teachers were absent at the time of random spot check in a representative number of schools. This finding is unfortunate because absenteeism could result in the low quality of education, particularly the low achievement in mathematics among students. Another study on teacher quality, which was conducted by Saito, Harun, Kuboki and Tachibana (2006), also revealed that mathematics teachers seldom pay attention to the learning processes of students. Teachers still seem to conceive a lesson only from the perspective of teaching models, such as the "chalk and talk," demonstration, and group discussion approaches. This is evident in the dominant interest in teaching models, the lack of attention to detail in the learning processes of students, and the lack of questioning the reasons for mistakes and the misconceptions of students. In addition, teachers used most contact time to explain and solve mathematics problems, while students remain passive and simply copy what their teacher writes on the blackboard.
However, it is possible that other factors, such as experience with similar tests or a lesser propensity to guess, contributed to a different test-taking approach. These possibilities merit further investigation to determine the reasons that DIF items exist.
The results of this study suggest that future research should investigate other areas. For example, it important to determine the ways in which the results of items and item analyses differ. The current study only predicted that curriculum differences, instructional practices, and teacher quality were some factors contributing to DIF items. Future research should attempt to investigate the sources of DIF using the same data so that appropriate intervention can be made to improve the quality of test design. This study was an initial step in assessing DIF items. Problematic items identified by the statistical procedure could be examined more thoroughly to determine any other potential sources that were not found in this study.
Future research could also use more than one DIF technique to assess TIMSS test items so that the pattern of agreement of the procedures may produce reliable, generalizable results of DIF items. Yildirim (2006) suggested that using more than one method would lead to better understanding because multiple methodologies would compensate the defects of others.
In addition, this study found that many items in the TIMSS (2007) mathematics test recoded bad IMS and exhibited item bias. However, it was difficult to establish the reasons that they showed bad fitting or bias. Therefore, it is suggested to carry out replication studies or in-depth investigations before decisions are made to eliminate items identified as bad fitting and biased items in future TIMSS mathematics tests.

Conclusions
The investigation of item bias using the DIF technique of the Rasch model indicated that country DIF was a problem in the mathematics test items. Using Australian and Indonesian data, the analyses of country DIF identified that about 75% of the total number of items in each booklet being tested exhibited significant bias. The findings showed that 20 items in Booklet 8 and 23 items in Booklet 9 were identified as biased items. In addition, these items had differences in threshold values, and the standardized differences in item threshold were outside the predefined ranges. Furthermore, many items were apparently biased in favor of one group or the other. Based on the results of the analyses conducted in this study, it was concluded that TIMSS (2007) has many DIF items, and there was a big difference in ability between the two groups.
In addition, the country DIF analyses revealed that the Australian students generally performed better, and they found that the items in each booklet were relatively easier than the Indonesian students did. This DIF was consistently significant in both booklets used in the country DIF analyses. The differences in item performance observed in this study indicate serious limitations in using TIMSS results to make comparisons between students in Australia and Indonesia. Thus, further empirical evidence is needed before the results of TIMSS (2007) can be meaningfully used in research.