The Development of IRT Based Attitude Scale towards Educational Measurement Course

In this study, the Scale of Attitude towards Educational Measurement and Evaluation (SAEM) developed by Demirtaşlı (2002) is reconstructed based on polytomous Item Response Theory (IRT) models and its psychometric features are identified. In this context, the best polythomous IRT model was investigated which is fitted SAEM data. IRT models gives invariant person and item parameters, when data-model fit. A version of SAEM has 41 Likert type items with four points was administered to 519 teacher candidates attending teacher education programs at several universities in Turkey. The data were analyzed according to polythomous IRT models: Samejima’s graded response model (S-GRM), the partial credit model (PCM) and a nominal response model (NRM). The results of the analysis showed that a new version of SAEM, which is based on S-GRM, consists of 33 items, has lower chi-square value than the other models and the classic internal reliability was found to be 0.97. The findings of the study indicate that the validity and reliability features of the scale are fairly good.


INTRODUCTION
Educational measurement and evaluation is a compulsory course in undergraduate and teacher certification programs. In spite of, teachers spend as much as a third of their professional time in assessment related activities and many of these activities require skills in testing and measurement (Wise, Lukin and Roos, 1991), some pre-service and in-service teachers have concerns and negative attitude about succeeding in these math-based subjects (Brady & Bowd, 2005;Gresham, 2010;Jaggernouth, 2010;Kottke, 2000). As an affective trait, attitude is a tendency to respond in direction of approaching or avoiding to an object, person, institution, or event (Ajzen, 2005). This tendency can have an indirect positive or negative impact on learning behavior (Perkins, Adams, Pollock, Finkelstein & Wieman, 2005;Reed, Drijvers & Kirschner, 2010;Shih & Gamon, 2001). Several studies have investigated the attitudes of pre-service and in-service teachers towards the measurement and evaluation course and their self-efficacy in this course (Aktaş & Alıcı, 2012;Kılınç, 2011;Kilmen & Demirtaşlı, 2009;Ozan & Köse, 2013;Özbaşı & Demirtaşlı, 2013;Ulutaş, 2003). Recognizing the attitudes of pre-and in-service teachers towards the measurement and evaluation course can be used to create a more positive learning environment in education and training programs. Searching and analyzing the negative attitudes of student teachers in relation to a course using a valid and reliable attitude scale helps to identify the pedagogic action to be taken to change teacher candidates's attitudes from negative to positive. This situation contributes to the establish a positive learning climate.
In education and psychology, test construction is based on primarily two test theories; the classical test theory (CTT) and the item response theory (IRT). The theoretical foundations of IRT dates back to the 1950s however, since IRT-based estimations require complex mathematical and statistical processes, the remarkable progress in this area was observed after the 1980s with the significant innovations in computer and software technology. When studied on a dataset that meets its basic assumptions, IRT can overcome the limitations of CTT and provides several advantages for the scaling process. In scale-development studies based on IRT, when the basic assumptions of IRT are fulfilled and the data fit the model, invariant person and item parameters can be estimated (De Ayala, 2009;Hambleton et al., 1991). Therefore, IRT based tests are not necessarily to establish conventional test norms for items measure in the same way at subsamples from the same population (Embretson & Reise, 2000, p. 25;Hambleton et al., 1999). An IRT-based scale can be used as a valid and reliable instrument to estimate the traits of subsamples. With this advantage, IRT can also be used to solve other measurement problems such as those related to the test equating, computer based adaptive testing, detecting of biased items.
In this context, the purpose of IRT based SAEM is to benefit from IRT's advantages such invariant item and theta parameters when model-data fit. By means of invariance, no further norm studies in interpretation of SAEM scores, comparison of groups. Besides, since IRT models give individual error estimations in item and person level, IRT based SAEM will be able to measure attitudes towards educational measurement course more reliably. In addition to this advantage, it can be detect possible item bias for several participants' background variables like type of under graduate program (Social sciences, Science), level of attitudes towards numerical content courses. Finally, when SAEM developed based on IRT, paralell forms of SAEM can be construct more easily and reliably.

Purpose
In this study, IRT was used to reconstruct a Likert-type CTT-based scale (SAEM) developed by Demirtaşlı (2002) to measure the attitude towards educational measurement and evaluation. In this context, the best polythomous IRT model that fits attitude data was investigated. For this purpose, the psychometric characteristics of the SAEM were tested under Samejima's Graded Response Model (GRM), Partial Credit Model (PCM) and Nominal Response Model (NRM) (Embretson & Reise, 2000)

Data Collection
The scale reconstructed in this study was developed to measure the attitudes towards the measurement and evaluation course, which is compulsory in teacher education and teacher certification programs. This scale is a four-point graded Likert scale consisting of 41 items, and was found to measure valid and reliable with three factors. The results of Cronbach's alfa correlation coefficients showed that the reliability for SAEM's each factor were ranged from .82 and .92 (Demirtaşlı, 2002). The following four categories was used to respond to all items in the scale; 1=strongly disagree, 2=disagree, 3=agree and 4=strongly agree. The items were scored reverse that are express negative attitudes: 4 for "strongly disagree" and 1 for "strongly agree". The minimum and maximum scores of the scale are 41 and 164, respectively. A higher score means that the participant has a more positive attitude towards the measurement and evaluation course, and a lower score indicates a negative attitude.

Data Analysis
Data analysis was performed in three stages. First, the participants' responses to the items were scored. Then, the data were analyzed in terms of basic IRT assumptions namely unidimensionality and local independence. When the data fit the IRT-based models, invariant person and item parameters can be estimated (Embretson & Reise, 2000;Hambleton et al., 1991). This feature of IRT helps to construct tests for the expected features, and also, equate test forms and develop computerized adaptive testing.
The scale dimensionality was detected by a Principal Component Analysis (PCA). The data were analyzed by the SPSS 15.0. The statistical convenience of the items to the PCA was determined using their Kaiser-Mayer-Olkin (KMO) value and the results obtained from Bartlett's test. The KMO value was found to be 0.97, and according to the result of Bartlett's test, the chi-square statistic was significant (χ 2 (820) =13163.31; p<0.05). These findings indicate that the items of the scale fit the PCA. In the first analysis, 41 items were loaded under five components. In initial analysis, the fivecomponent structure was observed that accounted for 60% of the total variance and ten items had loadings more than one factor (factor loading > 0.40). The scree plot (Figure 1) of the data shows a rapid decrease in the eigenvalue from the first to the second factor. Based on this result, it was concluded that SAEM had a dominant one factor. After that, factor structure of the scale was re-analyzed by restricting it to a single factor with varimax rotation. The results of the PCA restricted to a single factor showed that 41 items explained 44% of the total variance and factor loadings were varied from 0.35 to 0.77. Based on these results, it can be concluded that SAEM has a dominantly unidimensional structure and thus met the unidimensionality assumption of IRT. Another assumption of IRT is local independence, which means that at a given trait level, a test taker's response to an item is independent from the other items. In other words, a response to any of the items in the scale (e.g. endorsing or rejecting a certain attitude) is not dependent on the response to another item. This is observed when the unidimensionality assumption is met. In a test identified as unidimensional, the covariance between the items is zero for subjects at the same latent trait. This indicates that once the unidimensional assumption is met, the local independence assumption is also met (Hambleton & Swaminathan, 1985). As a result, the 41-item scale used in the current study was considered to have fulfilled the assumption of local independence.

Journal of Measurement and Evaluation in Education and Psychology
In the second phase of the data analysis, items were detected in terms of bias. The probable source of bias for this data is gender. In testing procedure, the individual differences resulted from the measured trait, rather than the gender of the participants with the same latent trait. To this end, the items in the scale were analyzed to determine whether they displayed differential item functioning (DIF) in terms of gender. To detect the DIF of polytomous items, the Polytomous Simultaneous Item Bias Test (PSIBTEST) and IRT Likelihood Ratio Test (IRT-LRT) were used. In the PSIBTEST method, DIF is determined through a regression-based correction that can determine Type I error (Clauser & Mazor, 1998).
IRT-LRT is based on a comparison of observed and theoretical models (Thissen, Steinberg & Wainer, 1993) which requires restricted and extended models. In the restricted model, which assumes that none of the items has DIF, the probability of the parameters of all items being equal is calculated. In the extended model, the likelihood of item parameters, for which DIF is detected, being different in the focal and reference groups with other parameters being equal is found. The G 2 value is calculated by subtracting the two −2log likelihood values obtained from the likelihood ratio of the restricted and extended models (Thissen, 2001). The calculated G 2 value is then compared to the chi-square value with the degrees of freedom. The degrees of freedom is the number of parameters in the model, and thus in the current study, it was four (df= three threshold parameters and one discrimination parameter). If the G 2 value is less than 9.49, it is interpreted that a negligible DIF level is present; if it is higher, then there is a medium or high level of DIF against the focal  (Greer, 2004). The IRT-LRT analysis method uses anchor items to equate the groups. For the selection of anchor items, the following criteria are used; having a high level of discrimination, having a high range of difficulty level, displaying no DIF according to other DIF detection methods, producing a small error variance in the PCA and having high factor loadings (Yıldırım, 2006). In this study, the criteria for the selection of anchor items were that they represented both way of the attitude, display no DIF according to the result of PSIBTEST and have high factor loadings in PCA. As a result, items 19, 27, 29 and 34 were selected as anchor. DIF analyses were performed DIFPACK 1.7 and IRTLRDIF 2.0b packages. Table 1 presents the results of the analyses performed using two DIF detection methods. As shown in Table 1, two items were found to display DIF in both methods (items 13 and 17). Item 13 was, "I would like to take other measurement courses" and item 17 was, "I wish I could take more measurement and evaluation courses". Both items were in favor of the male participants. In other words, when male and female students with the same level of attitude were compared, the probability of male students moving from "agree" to "strongly agree" was found to be significantly higher. Following the analysis performed, these items were excluded from the scale.
In the third stage of data analysis, the remaining 39 items were analyzed according to Samejima's GRM, PCM and NRM using the MULTILOG 7.03 package. Samejima's GRM is used to measure items that have ordered categorical responses such as Likert type scale items, and it is an extension of the two-parameter logistic (2PL) model. In GRM, the threshold values of response categories should be ordered, which is not required by PCM or generalized PCM (Embretson & Reise, 2000). PCM was originally developed for items that require responses in multiple steps. It is also used for the analysis of responses to items in scales that measure traits, in which two or more categorical responses are possible (such as attitude and personality traits). NRM is used to measure responses of similar format items but it does not require item choices to be ordered or identified numerically. The purpose of this model is to plot options characteristic curves based on the frequency of the selected choices in multiple-choice items. This model can also be applied to attitude and personality scales. All three models are used in items that are scored using grades and they have different advantages and disadvantages. For example, Samejima's GRM does not require the items to have the same number of categories. Therefore, it is appropriate for scales consisting of items with different response formats. Furthermore, this model is an extension of the 2PL model and allows the discrimination index to be different among items. PCM, on the other hand, is an extension of the Rasch Model, and as a result, raw score is sufficient statistics to estimate the ability level of an individual. However, in the PCM model, the slopes of all items in this model are considered to be equal. In other words, the model assumes that the discrimination index among items is equal, which is not that easy to realize in practice (Baker, Rounds & Zevon, 2000;Embretson & Reise, 2000).

RESULTS
Thirty-nine items of the scale were scaled using the three models, and Table 2 presents the maximum item information obtained from each model. As shown in Table 2, according to GRM, PCM and NRM, item information ranges from 0.14 to 2.89, from 0.15 to 7.07 and from 0.13 to 3.10, respectively. The total test information values obtained from the three models are presented in Table 2. The highest test information (67.56) provided from PCM at -1.20 theta (attitude) level. The highest test informations obtained from GRM and NRM respectively. Although the reliability coefficient of all three models was close to each other, the highest reliability coefficient value, 97.4, was obtained from PCM. The model-data fit level was determined by comparing -2 log likelihood values from polythomous model pairs. First, PCM and GRM were compared in terms of differences in -2 log χ 2 values, chisquare statistics and degrees of freedom (Df). Df is computed by multiplying the number of items with the number of parameters calculated for the estimation model. The number of parameters varies according to the model used for estimation; however, the number of "step difficulty/threshold/intercept" parameters substituting item difficulty equals the "number of categories-1" (Embretson & Reise, 2000). In PCM, for each item with four categories, three step difficulty parameters and two item slope parameters were calculated, and thus the degrees of freedom is 195 (39*5). In GRM, for each item with four categories, three threshold parameters and one item slope parameter were estimated, resulting in a degrees of freedom of 156 (39*4). According to this, χ 2 (195, 156)= 26115.8-25886.5= 229.3 and the approximate table value is χ² (39; 0.05) = 55.75. Since the calculated value is higher than the table value, the difference between the -2 log χ² values is significant. Therefore, it can be concluded that GRM is more appropriate for this type of data. In the second stage, the difference in -2 log χ² values obtained from GRM and NRM was determined and compared with the Chi-Square statistic using the 0.05 significance level and related degrees of freedom. In NRM, for each item with four categories, three intercept and three item slope parameters are computed, which results in degrees of freedom being 234 (39*6). χ² (156, 234)= 25886.5 -25832.4= 54.1 and the approximate table value is χ² (78; 0.05) = 101.88. Since the calculated value is lower than the table value, the difference between the -2 log χ 2 values is not significant. This indicates that there is no difference between the GRM and NRM models. Furthermore, in GRM, the reliability and maximum information values were found to be 0.973 and 49.84, respectively while in NRM, these were 0.970 and 45.97, respectively. Although no significant difference was observed between GRM and NRM in terms of data fit, estimations were performed using GRM since the highest maximum information and reliability was achieved with this model. Through the estimations on the Multilog program, the slope, threshold parameters and threshold information functions of all the items were obtained and analyzed. According to GRM, six of the 39 items (6, 7, 10, 18, 31 and 41) had a low level of item information (under .45), and therefore, they were excluded from the scale. Furthermore, differences in the observed and expected values were analyzed for each category and the items were found to be fit to the data. Table 3 presents the slope and threshold parameters of items estimated by the GRM model (33 items). As shown in Table 3, parameter 'a' is between 1.28 and 3.27. DeMars (2010) stated that the discrimination level of polytomous items is interpreted in the same way as dichotomous items. The discrimination level of items is classified as; very low (0.01-0.34), low (0.35-0.64), medium (0.65-1.34), high (1.35-1.69) and very high (>1.70) (Baker, 2001). On this basis, 33 items in the current scale had a high or very high level of discrimination, with item 27 having the highest and item 6 having the lowest level.

Journal of Measurement and Evaluation in Education and
The threshold values of categories vary from -2.83 to 2.62. Most of the first two threshold parameter values were found to be negative, which indicates that the responses to the first three categories had been endorsed by participants with much lower attitude levels (θ<0). The category threshold values in Table 3 show that the first threshold parameter is around "-2", the second was around "-1" and the third was around "0". This indicates that the scale better differentiates people with a lower attitude. Furthermore, the threshold parameter increasing in parallel to the attitude level suggests that the categories performes hierarchically as expected. Considering all these results, it is concluded that the discrimination characteristic and threshold values of the items in the scale are sufficiently high. Figure 2 presents the standard errors related to total information and measurement error obtained from the 33-item GRM-based SAEM. The scale only provided less information for individuals with an attitude level at the lowest or highest end. The maximum information obtained from the GRM-based scale was found to be 48.23, which was achieved at the attitude level of -1.40.

CONCLUSION and DISCUSSION
In related literature, there are many examples of cognitive test construction studies under IRT models. On the other hand, scales on affective traits have relatively limited that developed by IRT based models. In this study, As an attitude scale SAEM was rescaled based on polyhomous IRT models. Attitude is remarkable affective trait which is effective on our behaviours (Ajzen, 2005). Searching and displaying the negative attitudes of pre-service teachers in relation to a course using a valid and reliable attitude scale helps to identify the pedagogic efforts and activities to be taken to change negative attitudes in direction of positive attitudes. This contributes to the establishment of a positive learning environment. Differently from the CTT based scales, tests and scales based on IRT models separately estimates the probability of individuals at different trait (attitude) levels endorsing each category in each item, which provides more valid and reliable results in terms of the measures of individual differences. Thus, the functionality of both items and response categories can be estimated independently from the other items in the scale and the participant samples. In other words, from the total information and item information values obtained with this approach, it is possible to identify items and response categories that reveal the individual differences at different attitude levels (Le, 2013;Matteucci & Stracqualursi, 2006). In the current study, the model-data fit was higher in GRM than PCM. This may have resulted from the GRM criterion that the threshold values of response categories should be ordered. Thus, the statements "strongly disagree", "disagree", "agree" and "strongly agree" can be considered to indicate ordered responses. This study shows that the IRT based SAEM with 33-item is a valid and reliable instrument to determine the attitudes of pre-service teachers towards the measurement and evaluation course.
Future studies will be able to carry out with this instrument. This IRT based version of the scale is a valid and reliable scale that can be used in studies that compare the attitudes of teacher candidates from teachers and non-teachers college programs. And also, this scale can be used to investigate the effectiveness of a variety of actions undertaken during the teaching of the course to change any negative attitudes held by the pre-service teachers. Furthermore, this scale can be used to investigate the relationship between pre-service teachers' achievement in the measurement and evaluation course and their attitude towards it.