Simultaneous Estimation of Overall Score and Subscores Using MIRT, HO-IRT and Bi-factor Model on TIMSS Data

In educational testing, there is an increasing interest in the simultaneous estimation of the overall scores and subscores. This study aims to compare the reliability and precision of the simultaneous estimation of overall scores and sub-scores using MIRT, HO-IRT and Bi-factor models. TIMSS 2015 mathematics scores have been used as a data set in this study. The TIMSS 2015 mathematics test consists of 35 items, four of which are polytomously scored (0-1-2), and the rest of the items are dichotomously scored (0-1). The four content domains include number (14 items), algebra (9 items), geometry (6 items), and data and change (6 items). Ability parameters were estimated using the BMIRT software. The results showed that the MIRT and HO-IRT methods performed similarly in terms of precision and reliability for subscore estimates. The MIRT maximum information method had the smallest standard error of measurement for the overall score estimates. All three methods performed similarly in terms of the overall score reliability. The findings suggest that among the three methods compared, HO-IRT appears to be a better choice in the simultaneous estimation of the overall score and subscores for the data from TIMSS 2015. Recommendations for the testing practices and future research are provided.


INTRODUCTION
Many tests in educational and psychological testing generally measure more than one ability, which makes them multidimensional inherently (Reckase, 1985;1997). Tests may be inherently multidimensional due to the intended content or construct structure of the tests (Ackerman, Gierl, & Walker, 2003). Tests consisting of different content domains often measure a primary ability and additional abilities; thus, each item measures the primary ability and one additional secondary ability. Content categories can be considered as the source of secondary abilities. That is, while the primary ability is the estimated overall score, subscores for content categories are considered secondary abilities (DeMars, 2005). Subscores estimated from secondary abilities have been of substantial importance recently (DeMars, 2005;Reckase & Xu, 2015;Sinharay, Haberman, & Wainer, 2011;Wedman & Lyren, 2015). It is because of the potential diagnostic value of the subscores in future remedial work in which students have a chance to know their weaknesses and strengths in different content domains that the test measures (Haberman & Sinharay, 2010). Haberman (2008) and Sinharay (2010) focused on the added value of subscores over the total score by using Classical Test Theory methods. Brennan (2012) suggested the utility index similar to Haberman's method. Besides, the subscore augmentation method developed by Wainer, Sheehan, and Wang (2000) is used to examine whether getting information from other portions of the test (augmented subscore) estimates the subscore more accurately.
In multidimensional tests, when the overall score is reported, it shows the test-takers' achievement levels concerning the overall construct of the test subject. Subscores, on the other hand, give additional information about the strengths and weaknesses of test-takers in the domain abilities while the overall score presents a general profile of the test-takers. For example, the TOEFL test, which is the Englishlanguage test, has four content domains (reading, listening, speaking, and writing). For this test, testtakers receive four subscores related to each skill and a total score as a representative of general Englishlanguage ability. Since many tests have a multidimensional structure, the interest in estimating and reporting overall scores and subscores simultaneously has increased (Liu & Liu, 2017). Simultaneous estimation of those scores provides test takers and educators with more detailed information about the primary and secondary ability levels of students (Yao, 2010). More clearly, as opposed to the separate estimation of the primary and secondary abilities, simultaneous estimation means one can have the information on those abilities with one single analysis.
There are studies discussing the methods estimating the overall score and subscores simultaneously (de la Torre & Song, 2009;de la Torre & Song, 2010;Liu, Li, & Liu, 2018;Soysal & Kelecioğlu, 2018;Yao, 2010). In all these studies, it is emphasized that the reliability of scores is very important when the overall scores and subscores need to be reported. Yao (2010) states that the simple averaging method is the most commonly used method to obtain the overall score by averaging the domain scores. She also indicates that simply averaging the domain scores ignores (a) different maximum raw score points of different domains, (b) correlation between the domain abilities, and (c) the possibility of having a different relationship between overall scores and domain scores at different score points. In order to overcome these problems, Yao (2010) proposed using the Multidimensional Item Response Theory (MIRT) maximum information method for the overall score instead of the simple averaging method. The proposed method does not assume any linear relationship between the overall score and domain scores. In the study, subscores were estimated by using MIRT, and the overall scores were estimated by using the MIRT maximum information method. Estimated overall and subscores were compared to those obtained from the Higher-Order Item Response Theory (HO-IRT), Bi-factor, and unidimensional IRT methods. It is found that the MIRT method provides reliable subscores similar to the HO-IRT method and also reliable overall score. The MIRT maximum information method produced overall scores with the smallest standard error of measurement (Yao, 2010).
de la Torre and Song (2009) also proposed using Higher-order Item Response Theory approach for simultaneous estimation of overall and domain abilities. The HO-IRT method assumes a linear relationship between the overall score and the domain score, unlike the MIRT method. In the study, the HO-IRT method was compared with the unidimensional IRT (UIRT) in which the overall ability is estimated using all items ignoring the multidimensional structure of the data, and the domain abilities are estimated using corresponding subsets of items, separately. The findings of the study show that the overall and domain abilities can be estimated more efficiently by using the HO-IRT method. Additionally, in the HO-IRT framework, it is possible to obtain efficient overall and domain ability estimates with small sample sizes and small number of items (de la Torre & Song, 2010).
To estimate the overall score and domain scores based on the bi-factor model, Liu et al. (2018) introduced six methods in the framework of the bi-factor model and compared them with the MIRT method. The weights of the general and domain factors were calculated in different ways in those six bi-factor methods. It is found that the most accurate and reliable overall and domain scores in most conditions were obtained using Bi-factor-M4 and Bi-factor-M6 methods, weights of which were computed using discrimination parameters for a specific domain. In the bi-factor methods, the domain-63 specific factors are orthogonal to the general factor and each other, unlike the MIRT and HO-IRT methods.
Related research regarding simultaneous estimation of the overall and subscores seems to be few in number (de la Torre & Song, 2010;Liu et al., 2018;Soysal & Kelecioğlu, 2018;Yao, 2010). The present study aims to contribute to the related research. The purpose of the study is to investigate by using which method simultaneous estimation of the overall score and subscores yields more accurate and reliable ability estimates. For this purpose, MIRT, HO-IRT, and bi-factor general model, the most suggested methods in literature, were used in the study. This study also differs from earlier research in that it runs the analysis on mixed-format data, including both dichotomously and polytomously scored items, whereas all other studies used data consisting only dichotomously or polytomously scored items. At this point, using mixed-format data is thought to be important since tests containing a mixture of multiplechoice and constructed-response items are used in many testing situations (Lane, 2005;Yao & Schwarz, 2006).

Multidimensional Item Response Theory
Multidimensional Item Response Theory is a method that provides "a reasonably accurate representation of the relationship between persons' locations in a multidimensional space and the probabilities of their responses to a test item" (Reckase, 2009, p. 53) with a particular mathematical expression. An essential distinction between MIRT models related to the structure of the data is whether the probability of responses to any test item is influenced by one latent dimension or not. If this is the case, the structure of the data is defined as between-item dimensionality (simple-structure). If responses to one item are affected by more than one ability, then, it is denoted as within-item dimensionality (complex structure; Adams, Wilson, & Wang, 1997). In this study, the data were assumed to follow a simple structure because each item was modeled as depending on one specific ability dimension.
Additionally, there are several models within MIRT varying basically in terms of the number of possible score points for the items: MIRT models for dichotomously scored items and MIRT models for polytomously scored items. All of the MIRT models can be considered as generalizations of unidimensional IRT models (Reckase, 1997). However, many tests contain both dichotomously and polytomously scored items on the same test form, which creates a need to use different item response models together (Yao & Schwarz, 2006). TIMSS mathematics achievement test also contains mixed item types. Therefore, in the present study, the TIMSS data were examined using the multidimensional three-parameter logistic (M-3PL) model for dichotomously scored items and the multidimensional twoparameter partial credit model (M-2PPC) applied to polytomously scored items as suggested in the study of Yao & Schwarz (2006). For a dichotomous item j, the probability of a correct response to item j for an examinee with ability ⃗ ⃗ i = (θi1, θi2, ..., θiD) for the M-3PL model (Reckase, 1997) is where = the response of examinee i to item j ⃗ ⃗ j = the parameters for the j th item ( 2 , 1 , 3 ) ⃗ ⃗ = a vector of dimension D of item discrimination parameters ( 2 1 , …, 2 ) 1 = the scale difficulty parameter 3 = the scale guessing parameter ⃗ ⃗ ⊙ = a dot product of two vectors.

64
For a polytomous item j, the probability of a response k−1 to item j for an examinee with ability ⃗ ⃗ i for the M-2PPC model (Yao & Schwarz, 2006) is where = the response of examinee i to item j (0, …, − 1) ⃗ ⃗ j = the parameters for the j th item ( ⃗ ⃗ , 2 , …, ) ⃗ ⃗ = a vector of dimension D of item discrimination parameters ( 2 1 , …, 2 ) = the threshold parameters for k = 1, 2, …, ; 1 = 0 and = the number of response categories for the j th item.

Higher-Order Item Response Theory
de la Torre and Song (2009) proposed a higher-order multidimensional IRT approach in which overall and domain abilities can be specified simultaneously. In this model, the first order describes domainspecific abilities, while the second-order can be viewed as the overall ability. It is considered that each domain is unidimensional; the second-order ability contains all the domain abilities, so the overall ability is also viewed as unidimensional. de la Torre and Hong (2010) stated that a test is deemed multiunidimensional in the HO-IRT framework.
The HO-IRT method uses a hierarchical Bayesian framework (de la Torre et al., 2011), and the domain abilities are considered as linear functions of the overall ability, expressed as where = the overall ability, ( ) = the domain-specific abilities, d = 1, 2, …, D, ( ) = the latent coefficient in regressing the ability d on the overall ability, = the error term following a normal distribution with a mean of zero and variance of 1 − ( )2 , and | ( ) | ≤ 1.
The latent regression coefficient, ( ) , also means the correlation between the overall and domain abilities. Mathematically, ( ) can have negative values, but it is generally expected to be positive since domain abilities are typically related to the overall ability.
Focusing on estimating abilities of test-takers (Equation 3), the model parameters that need to be estimated are the overall ability, domain abilities, and the latent regression parameters (1) , (2) , … , ( ) . With a hierarchical Bayesian framework, the model formulation is expressed as follows (de la Torre & Song, 2009):

Bi-factor General Model
The bi-factor model (Gibbons & Hedeker, 1992) defines a general factor on which all the items load and domain-specific factors on which the items related to that dimension load. The domain-specific factors are orthogonal to the general factor. The method provides estimates of the overall ability and domain abilities at the same time. It is considered that the domain factors are nuisance traits within the Bi-factor framework, which yields a more meaningful overall ability (DeMars, 2013;Yao, 2010). Cai, Yang, and Hansen (2011) demonstrated the factor pattern of the standard item bi-factor measurement structure as .
As seen in the pattern, there are six items, one general and two domain-specific factors. The as are the indicators of item discrimination parameters, which are similar to the factor loadings. The first factor is the general factor, and the last two columns refer to the domain factors (Cai et al., 2011).
As defined in Liu et al.'s (2018) study, in the vector of item discrimination parameters, only the one for the general factor ( ) and one discrimination parameter of s th subscale ( ) have values other than zero. The ability vector of each examinee includes one overall ability for the general factor ( ) and domain-specific abilities for S specific factors ( 1 , … , , … , ).
Based on the Bi-factor model, estimation of the overall score and domain scores can be expressed as follows: and where 1 = weight of the general factor for the overall score 1 = weight of the domain factors for the overall score 2 = weight of the general factor for the domain scores 2 = weight of the domain factors for the domain scores. Thus, the overall score (Equation 7)) is a weighted composite of the general factor ( ) and all domain factors (( 1 , … , , … , ), while the domain score (Equation 8) for the s th factor is a weighted composite of the general factor ( ) and the relevant domain-specific factor ( ). In the current study, the Bi-factor general model was employed by using 1 and 0 as the weights, as in the study of Yao (2010): 1 = 1, 1 = 0 and 2 = 0, 2 = 1. In this method, the general factor represents the overall score, while the domain-specific factors represent subscores.

Data Description
Eighth graders' responses to the mathematics test in Trends in International Mathematics and Science Study (TIMSS) 2015 were used in this study. Each country's data from the 1 st booklet of mathematics achievement test were merged into a whole data set. The reason behind choosing 1 st booklet is that it is the booklet that has the largest number of polytomously-scores items (four items). For handling missing data, the listwise deletion method was utilized because the researchers aimed to analyze the data consisting of the subjects who answered all of the items The final version of the data consists of 5732 students from all the countries who were administered the 1 st assessment booklet in TIMSS 2015. Table  1 shows the distribution of scoring types and contents for the chosen test form for the current study. As shown in Table 1, the test has four content domains, which are number (14 items), algebra (9 items), geometry (6 items), and data and change (6 items). The total number of items is 35, four of which are polytomously scored (0-1-2), and the rest of the items are dichotomously scored (0-1).

Dimensionality analysis
In order to improve interpretations and uses of scores, the dimensional structure of the data is essential to get evidence of validity (Reckase & Xu, 2015). Dimensionality shows the relationship between a test and response patterns, which gives clues about the latent structure measured by the test. Wainer and Thissen (1996) mention the fixed and random forms of dimensionality. While random dimensionality is a concept explaining the possibility of encountering some "unexpected" dimensions, fixed dimensionality is a somewhat "expected" situation. In particular, it is usual to see multidimensionality in scores when the test has multiple content domains. It can be assumed that the data have a multidimensional structure when the test has content domains. Under this circumstance, it is said that it might be more reasonable and effective to use confirmatory dimensionality assessment (Zhang, 2016). Therefore, confirmatory methods were used to assess the dimensionality structure of the data in this study. Confirmatory Factor Analysis (CFA) and content-based confirmatory mode of Poly-DETECT (Zhang & Stout, 1999a, 1999bZhang, 2007) were the methods utilized as dimensionality analysis in the current study.
The poly-DETECT analysis was done through the sirt package (Robitzsch, 2018). The result of the analysis gives the indices DETECT, ASSI and RATIO. The information about the evaluation of these indices is presented in Table 2 (Jang & Roussos, 2007;Zhang, 2007)

Estimating overall score and subscores
Three estimation methods (MIRT, HO-IRT, and Bi-factor) were used to obtain the overall score (mathematics achievement) and subscores (number, algebra, geometry, and data and chance) for 5732 test takers who were administered the first booklet of TIMSS 2015. Ability parameters for the methods were estimated using the BMIRT software (Yao, 2003;Yao, 2013;Yao, Lewis, & Zhang, 2008). In the present study, the data were analyzed using the M-3PL model for dichotomously-scored items, and the M-2PPC applied to polytomously-scored items for all of the estimation methods. The following are brief explanations of the estimation methods and what they estimate in the context of the current data: -MIRT: the simple structure MIRT analysis was used to estimate abilities based on four content domains. It gives four thetas (θ), each of which represents single subscore. The overall score was obtained by domain scores using maximum information method as in Yao (2010).
-HO-IRT: It is assumed that there is a linear relationship between the overall score and subscores, so the parameters for the overall ability and domain abilities were estimated simultaneously.
-Bi-factor: The Bi-factor general model estimated five abilities. The first one was the general dimension, and the other four abilities were content-specific dimensions, respectively. In the bifactor model, content-specific dimensions are orthogonal to each other and the general dimension, and there is no correlation between dimensions.
The default priors of BMIRT software were used for the analyses in this study. The mean and variance of the ability prior distribution were 0.0 and 1.0, respectively. The priors were taken to be lognormal for the discrimination parameters with a mean of 1.5 and variance of 1.5. For the difficulty or threshold parameters, a standard normal distribution with a mean of 0.0 and variance of 1.5 was used. Guessing parameter c had prior beta (α, β) distribution, in which α = 100 and β =400.

Evaluation criteria
The conditional standard error of measurement (cSEM) was used to evaluate the accuracy of overall scores and subscores. The BMIRT program calculated the cSEM values for each student's ability parameters under studied methods estimating the overall and domain scores simultaneously. Then, the analysis of variance (ANOVA) on repeated-measures data for the cSEM was conducted to examine whether there is a significant difference among the mean errors calculated by estimation methods.
The other criterion for the evaluation of methods is reliability. A method proposed by de la Torre & Patz (2005) called Bayesian marginal ability or empirical reliability (Brown & Croudace, 2015) was applied for this study. The reliability of test d can be obtained from The observed (Equation 10) and marginal posterior (Equation 11) variance of the overall or domain ability estimates are computed from the estimated ability scores ̂ and their standard errors (SE) in a sample of N test takers: For this study, reliability measures for one overall score and four subscores were obtained from the equations above for each studied methods. Higher marginal reliability indicates higher reliability of scores from the methods tested (Md Desa, 2012).

Dimensionality Analysis
Poly-DETECT (confirmatory mode) and Confirmatory Factor Analysis were conducted in order to examine the multidimensionality due to the content domains for mixed-format TIMSS data used in this study. Table 3 shows the results of the content-based Poly-DETECT analysis. As seen in Table 3, the results yielded an essential deviation from unidimensionality in which ASSI = .459 and RATIO = 0.522. DETECT index, which is .406, means moderate multidimensionality. The values of indices obtained from the Poly-DETECT analysis provide evidence of multidimensionality for the current data.
A four-factor model was tested through CFA. The content domains with related items were taken as factors, and the model fit was evaluated. Fit indices for the data and the associated criteria are presented in Table 4.

69
CFI and TLI indicated that the model fits the data well (≥ 0.95). Likewise, the RMSEA value (≤ 0.05) showed a good fit (Table 4). According to the results of CFA, the four-factor model had a good fit with the present data, which supported content-based multidimensionality. After providing evidence of the content-based multidimensionality of the data, the overall and domain abilities were obtained with the aforementioned methods.

Precision of Estimates
The selected three methods (MIRT, HO-IRT, and Bi-factor) for the current study were used through running the BMIRT program to estimate the overall and subscores simultaneously. BMIRT also provided standard errors for the estimated scores. The means for standard errors for the overall and domain ability estimates under each estimation method are summarized in Table 5.  Table 5 shows the means and standard deviations for the standard errors for each ability. Generally, MIRT and HO-IRT yielded similar results, but the HO-IRT estimation method performed slightly better than MIRT for domain abilities. The Bi-factor model gave the worst standard errors for the domain abilities among all the methods and similar to the MIRT for the overall ability. The repeated-measures ANOVA results whether the difference between standard errors are statistically significant are presented in Table 6.  Table 4, the HO-IRT method had the lowest standard errors for all domain abilities, and MIRT had the second-lowest standard errors. Domain abilities from the Bi-factor model were not as accurate as the other two methods.
Therefore, it can be concluded that HO-IRT elicited a statistically significant reduction in standard errors of domain ability estimates. Likewise, the overall ability results showed that the standard errors were significantly affected by the type of estimation method (F(1.692, 9696.490) overall = 8162.767, p < .05, partial η 2 = .588). Post hoc tests using the Bonferroni correction revealed that all pairwise comparisons were significantly different from each other. The HO-IRT had the highest mean for standard errors. The MIRT and Bi-factor model had low and similar standard errors for the overall ability. In general, the three estimation methods were significantly different for all the abilities, including the overall and domain abilities.

Reliability of Scores
The overall and four domain ability estimates from the studied methods were compared in terms of marginal reliability. Estimated reliability coefficients are presented in Table 7.  Table 7 presents the Bayesian marginal reliability of the overall score and subscores based on four content domains. In general, MIRT and HO-IRT had substantially higher reliability across all content domains compared to the reliability of the Bi-factor model. The reliability of the Bi-factor model was extremely low for the domain scores, especially for geometry (i.e., 0.253) and data and chance (i.e. 0.161). In addition, the reliability of domains varied slightly between domains for MIRT and HO-IRT. The reliability coefficient of HO-IRT subscores was for number, 0.894; for algebra, 0.838; for geometry, 0.824, and for data and chance, .809. It can be concluded that HO-IRT was the most reliable method of estimating subscores, followed by MIRT, for all content domains for the data used in the current study. Furthermore, the reliabilities of all methods decreased as the number of items in the domains decreased. The reliability of the overall score was for MIRT, 0.816; for HO-IRT, 0.815, and for Bi-factor, 0.876. Unlike the subscores, the Bi-factor model was the most reliable method for the overall score estimation. The other two methods (MIRT and HO-IRT) also estimated the overall score with high reliability.

DISCUSSION and CONCLUSION
When the overall and domain abilities are reported to the test takers and used by the authorities, it is important to obtain accurate and reliable estimates of the overall score and subscores. The overall scores are useful in reporting the test-takers' general achievement and taking important decisions such as rankordering the test takers. On the other hand, the subscores provide test takers, teachers, or policymakers with more diagnostic information such as strengths and weaknesses in each domain. The simultaneous estimation of those scores can be another solution to both of the needs.
This study examined three methods of estimating the overall score and subscores simultaneously in the same model, including MIRT, HO-IRT, and Bi-factor, and compared the reliability and precision of these methods across the overall and domain ability estimates. For this purpose, the real data of mixed item types from TIMSS 2015 were used. The results of Poly-DETECT and CFA provided evidence for the content-based multidimensional structure of the data. The study showed that the MIRT and HO-IRT methods performed similarly in terms of precision and reliability for subscore estimates. However, HO-IRT had slightly lower standard errors and higher reliability than MIRT. Likewise, de la Torre and Song (2009) stated that domain ability estimates can be more efficient by using the HO-IRT model. In addition, Yao (2010) found that MIRT and HO-IRT were quite similar in terms of estimating subscores. The precise ability estimation and reliable scores by using HO-IRT also supported the use of subscores for reporting for the current data. The Bi-factor general model had the highest standard errors and lowest reliability estimates for the domain scores. Liu et al. (2018) also did not recommend the Bi-factor, the original factor method, for reporting scores. They proposed six other methods of reporting overall and subscores as weighted composite scores of the overall and domain-specific factors in a bi-factor model.
For the overall ability estimation, the MIRT maximum information method and Bi-factor model outperformed the HO-IRT method with regard to standard errors. The MIRT maximum information method had the smallest standard error of measurement for the overall score estimates, as in the study of Yao (2010). While all three methods performed similarly and relatively good in terms of the overall score reliability, the reliability of Bi-factor model was a bit higher than the other two methods.
The analyses of the current study suggested that overall, HO-IRT seems the best solution for the simultaneous estimation of the overall and subscores for the data from TIMSS 2015. Soysal and Kelecioğlu (2018) also recommended the use of HO-IRT in estimation of overall and subscores in their study.
In the present study, only real data were used to examine the relative performance of the three methods, since the true model for the data was not known. Therefore, it is quite possible to get different results for other samples. It is suggested that future research can be done by using other real data. It is also advisable that when the simultaneous estimation of the overall and domain abilities must be done in testing practices, the relative performance of the estimation methods should be checked before reporting the scores to test takers.