Abstract
In this chapter, we describe item response theory (IRT) equating methods under various designs. This chapter covers issues that include scaling person and item parameters, IRT true and observed score equating methods, equating using item pools, and equating using polytomous IRT models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Baker, F. B. (1992). Equating tests under the graded response model. Applied Psychological Measurement, 16, 87–96.
Baker, F. B. (1993a). Equate 2.0: A computer program for the characteristic curve method of IRT equating. Applied Psychological Measurement, 17, 20.
Baker, F. B. (1993b). Equating tests under the nominal response model. Applied Psychological Measurement, 17, 239–251.
Baker, F. B. (1996). An investigation of the sampling distributions of equating coefficients. Applied Psychological Measurement, 20, 45–57.
Baker, F. B. (1997). Empirical sampling distributions of equating coefficients for graded and nominal response instruments. Applied Psychological Measurement, 21, 157–172.
Baker, F. B., & Al-Karni, A. (1991). A comparison of two procedures for computing IRT equating coefficients. Journal of Educational Measurement, 28, 147–162.
Baker, F. B., & Kim, S. H. (2004). Item response theory: Parameter estimation techniques (2nd ed.). New York: Marcel Dekker.
Béguin, A. A., & Hanson, B. A. (2001, April). Effect of noncompensatory multidimensionality on separate and concurrent estimation in IRT observed score equating. Paper presented at the The Annual Meeting of the National Council on Measurement in Education, Seattle, WA.
Béguin, A. A., Hanson, B. A., & Glas, C. A. W. (2000, April). Effect of multidimensionality on separate and concurrent estimation in IRT equating. Paper presented at the American Educational Research Association, New Orleans, LA
Bejar, I. I., & Wingersky, M. S. (1982). A study of pre-equating based on item response theory. Applied Psychological Measurement, 6, 309–325.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29–51.
Bock, R. D. (1997). The nominal categories model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 34–49). New York: Springer.
Bolt, D. M. (1999). Evaluating the effects of multidimensionality on IRT true-score equating. Applied Measurement in Education, 12, 383–407.
Brennan, R. L., Wang, T., Kim, S., & Seol, J. (2009). Equating recipes. Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, University of Iowa.
Brossman, B. G. (2010). Observed score and true score equating procedures for multidimensional item response theory. (Doctoral Dissertation, University of Iowa). Available from ProQuest Disserations and Theses database. (UMI No. 3409412).
Camilli, G., Wang, M.-M., & Fesq, J. (1995). The effects of dimensionality on equating the Law School Admission test. Journal of Educational Measurement, 32, 79–96.
Cohen, A. S., & Kim, S. H. (1998). An investigation of linking methods under the graded response model. Applied Psychological Measurement, 22, 116–130.
Cook, L. L., Dorans, N. J., Eignor, D. R., & Petersen, N. S. (1985). An assessment of the relationship between the assumption of unidimensionality and the quality of IRT true-score equating (Research Report 85–30). Princeton, NJ: Educational Testing Service.
Cook, L. L., & Eignor, D. R. (1991). An NCME instructional module on IRT equating methods. Educational Measurement: Issues and Practice, 10, 37–45.
Cook, L. L., & Petersen, N. S. (1987). Problems related to the use of conventional and item response theory equating methods in less than optimal circumstances. Applied Psychological Measurement, 11, 225–244.
Davey, T., Oshima, T. C., & Lee, K. (1996). Linking multidimensional item calibrations. Applied Psychological Measurement, 20, 405–416.
de Ayala, R. J. (2009). The theory and practice of item response theory. New York: Guilford.
De Champlain, A. F. (1996). The effect of multidimensionality on IRT true-score equating for subgroups of examinees. Journal of Educational Measurement, 33, 181–201.
DeMars, C. E. (2002). Incomplete data and item parameter estimates under JMLE and MML estimation. Applied Measurement in Education, 15, 15–31.
DeMars, C. E. (2012). Confirming testlet effects. Applied Psychological Measurement, 36, 104–121.
DeMars, C. E., & Jurich, D. P. (2012). Software note: Using Bilog for fixed-anchor calibration. Applied Psychological Measurement, 36, 232–236.
Divgi, D. R. (1985). A minimum chi-square method for developing a common metric in item response theory. Applied Psychological Measurement, 9, 413–415.
Dorans, N. J., & Kingston, N. M. (1985). The effects of violations of unidimensionality on the estimation of item and ability parameters and on item response theory equating of the GRE verbal scale. Journal of Educational Measurement, 22, 249–262.
Eignor, D. R. (1985). An investigation of the feasibility and practical outcomes of preequating the SAT verbal and mathematical sections (Research Report 85–10). Princeton, NJ: Educational Testing Service.
Eignor, D. R., & Stocking, M. L. (1986). An investigation of the possible causes for the inadequacy of IRT preequating (Research Report 86–14). Princeton, NJ: Educational Testing Service.
Glas, C. A. W., & Béguin, A. A. (2011). Robustness of IRT observed-score equating. In A. A. von Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 297–316). New York: Springer.
Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, 144–149.
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory. Principles and applications. Boston: Kluwer.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.
Han, T., Kolen, M. J., & Pohlmann, J. (1997). A comparison among IRT true- and observed-score equatings and traditional equipercentile equating. Applied Measurement in Education, 10, 105–121.
Hanson, B. A. (2002). IRT command language (Version 0.020301, March 1, 2002). Monterey, CA: Author. Retrieved from http://www.b-a-h.com/software/irt/icl/index.html
Hanson, B. A., & Béguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26, 3–24.
Harris, D. J. (1989). Comparison of 1-, 2-, and 3-parameter IRT models. Educational Measurement: Issues and Practice, 8, 35–41.
Hirsch, T. M. (1989). Multidimensional equating. Journal of Educational Measurement, 26, 337–349.
Kaskowitz, G. S., & De Ayala, R. J. (2001). The effect of error in item parameter estimates on the test response function method of linking. Applied Psychological Measurement, 25, 39–52.
Keller, L. A., & Keller, R. R. (2011). The long-term sustainability of different item response theory scaling methods. Educational and Psychological Measurement, 71, 362–379.
Keller, L. A., Swaminathan, H., & Sireci, S. G. (2003). Evaluating scoring procedures for context-dependent item sets. Applied Measurement in Education, 16, 207–222.
Kim, J. (2006). Using the distractor categories of multiple-choice items to improve IRT linking. Journal of Educational Measurement, 43, 193–213.
Kim, J., & Hanson, B. A. (2002). Test equating under the multiple-choice model. Applied Psychological Measurement, 26, 255–270.
Kim, S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43, 355–381.
Kim, S. (2010). An extension of least squares estimation of IRT linking coefficients for the graded response model. Applied Psychological Measurement, 34, 505–520.
Kim, S., Harris, D. J., & Kolen, M. J. (2010). Equating with polytomous item response models. In M. L. Nering & R. Ostini (Eds.), Handbook of polytomous item response theory models (pp. 257–291). New York, NY: Routledge.
Kim, S., & Kolen, M. J. (2005). Methods for obtaining a common scale under unidimensional IRT models: A technical review and further extensions. (Iowa Testing Programs Occasional Papers No. 52). Iowa City, IA: Iowa Testing Programs.
Kim, S., & Kolen, M. J. (2007). Effects on scale linking of different definitions of criterion functions for the IRT characteristic curve methods. Journal of Educational and Behavioral Statistics, 32, 371–397.
Kim, S., & Lee, W. (2004). IRT scale linking methods for mixed-format tests. (ACT Research Report Series 2004–5). Iowa City, IA: ACT Inc.
Kim, S. H., & Cohen, A. S. (1992). Effects of linking methods on detection of DIF. Journal of Educational Measurement, 29, 51–66.
Kim, S. H., & Cohen, A. S. (1995). A minimum chi-square method for equating tests under the graded response model. Applied Psychological Measurement, 19, 167–176.
Kim, S. H., & Cohen, A. S. (1998). A comparison of linking and concurrent calibration under item response theory. Applied Psychological Measurement, 22, 131–143.
Kim, S. H., & Cohen, A. S. (2002). A comparison of linking and concurrent calibration under the graded response model. Applied Psychological Measurement, 26, 25–41.
Koenig, J. A., & Roberts, J. S. (2007). Linking parameters estimated with the generalized graded unfolding model: A comparison of the accuracy of characteristic curve methods. Applied Psychological Measurement, 31, 504–524.
Kolen, M. J. (1981). Comparison of traditional and item response theory methods for equating tests. Journal of Educational Measurement, 18, 1–11.
Kolen, M. J., & Harris, D. J. (1990). Comparison of item preequating and random groups equating using IRT and equipercentile methods. Journal of Educational Measurement, 27, 27–39.
Lee, G., Kolen, M. J., Frisbie, D. A., & Ankenmann, R. D. (2001). Comparison of dichotomous and polytomous item response models in equating scores from tests composed of testlets. Applied Psychological Measurement, 25, 3–24.
Lee, W., & Ban, J. (2010). A comparison of IRT linking procedures. Applied Measurement in Education, 23, 23–48.
Li, D., Jiang, Y., & von Davier, A. A. (2012). The accuracy and consistency of a series of IRT true score equatings. Journal of Educational Measurement, 49, 167–189.
Li, Y., Bolt, D. M., & Fu, J. (2005). A test characteristic curve linking method for the testlet model. Applied Psychological Measurement, 29, 340–356.
Li, Y. H., & Lissitz, R. W. (2000). An evaluation of the accuracy of multidimensional IRT linking. Applied Psychological Measurement, 24, 115–138.
Li, Y. H., Tam, H. P., & Tompkins, L. J. (2004). A comparison of using the fixed common-precalibrated parameter method and the matched characteristic curve method for linking multiple-test items. International Journal of Testing, 4, 267–293.
Linacre, J. M. (2001). A user’s guide to WINSTEPS/MINISTEPS [Computer software]. Chicago, IL: Winsteps.com.
Linn, R. L., Levine, M. V., Hastings, C. N., & Wardrop, J. L. (1981). Item bias in a test of reading comprehension. Applied Psychological Measurement, 5, 159–173.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.
Lord, F. M. (1982). Item response theory and equating—A technical summary. In P. W. Holland & D. B. Rubin (Eds.), Test equating (pp. 141–149). New York: Academic.
Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings”. Applied Psychological Measurement, 8, 452–461.
Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch Model. Journal of Educational Measurement, 17, 179–193.
Marco, G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14, 139–160.
Masters, G. N. (1984). Constructing an item bank using partial credit scoring. Journal of Educational Measurement, 21, 19–32.
McKinley, R. L. (1988). A comparison of six methods for combining multiple IRT item parameter estimates. Journal of Educational Measurement, 25, 233–246.
Mislevy, R. J., & Bock, R. D. (1990). BILOG 3. Item analysis and test scoring with binary logistic models (2nd ed.). Mooresville, IN: Scientific Software.
Muraki’s, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176.
Muraki’s, E. (1997). A generalized partial credit model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 153–164). New York: Springer.
Muraki, E., & Bock, R. D. (2003). PARSCALE (Version 4.1) [Computer software]. Chicago, IL: Scientific Software International.
Nering, M., & Ostini, R. (Eds.). (2010). Handbook of polytomous item response models. New York: Routledge.
Ogasawara, H. (2000). Asymptotic standard errors of IRT equating coefficients using moments. Economic Review, Otaru University of Commerce, 51, 1–23.
Ogasawara, H. (2001a). Item response theory true score equatings and their standard errors. Journal of Educational and Behavioral Statistics, 26, 31–50.
Ogasawara, H. (2001b). Least squares estimation of item response theory linking coefficients. Applied Psychological Measurement, 25, 3–24.
Ogasawara, H. (2001c). Marginal maximum likelihood estimation of item response theory (IRT) equating coefficients for the common-examinee design. Japanese Psychological Research, 43, 72–82.
Ogasawara, H. (2002). Stable response functions with unstable item parameter estimates. Applied Psychological Measurement, 26, 239–254.
Oshima, T. C., Davey, T. C., & Lee, K. (2000). Multidimensional linking: Four practical approaches. Journal of Educational Measurement, 37, 357–373.
Paek, I., & Young, M. J. (2005). Investigation of student growth recovery in a fixed-item linking procedure with a fixed-person prior distribution for mixed-format test data. Applied Measurement in Education, 18, 199–215.
Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1989). Numerical recipes. The art of scientific computing (Fortran version). Cambridge, UK: Cambridge University Press.
Quenette, M. A., Nicewander, W. A., & Thomasson, G. L. (2006). Model-based versus empirical equating of test forms. Applied Psychological Measurement, 30, 167–182.
Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. (Psychometrika Monograph No. 17) Richmond, VA Psychometrics Society.
Samejima, F. (1972). A general model for free-response data. Psychometrika Monograph Supplement, 37(1, Pt. 2), 68.
Samejima, F. (1997). Graded response model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 85–100). New York: Springer.
Shojima, K. (2003). Linking tests under the continuous response model. Behaviormetrika, 30, 155–171.
Stocking, M. L., & Eignor, D. R. (1986). The impact of different ability distributions on IRT preequating (Research Report 86–49). Princeton, NJ: Educational Testing Service.
Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201–210.
Thissen, D., Chen, W., & Bock, R. D. (2003). MULTILOG (Version 7.03) [Computer software]. Chicago, IL: Scientific Software International.
Thissen, D., Pommerich, M., Billeaud, K., & Williams, V. S. L. (1995). Item response theory for scores on tests including polytomous items with ordered responses. Applied Psychological Measurement, 19, 39–49.
Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51, 567–577.
Thissen, D., Steinberg, L., & Mooney, J. A. (1989). Trace lines for testlets: A use of multiple-categorical-response models. Journal of Educational Measurement, 26, 247–260.
Vale, C. D. (1986). Linking item parameters onto a common scale. Applied Psychological Measurement, 10, 333–344.
van der Linden, W. J., & Hambleton, R. K. (Eds.). (1997). Handbook of modern item response theory. New York: Springer.
von Davier, M., & von Davier, A. A. (2011). A general model for IRT scale linking and scale transformations. In A. A. von Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 225–242). New York: Springer.
von Davier, A. A., & Wilson, C. (2007). IRT true-score test equating. Educational and Psychological Measurement, 67, 940–957.
Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applications. New York: Cambridge University Press.
Wainer, H., Sireci, S. G., & Thissen, D. (1991). Differential testlet functioning: Definitions and detection. Journal of Educational Measurement, 28, 197–219.
Wingersky, M. S., Barton, M. A., & Lord, F. M. (1982). LOGIST users guide. Princeton, NJ: Educational Testing Service.
Wingersky, M. S., & Lord, F. M. (1984). An investigation of methods for reducing sampling error in certain IRT procedures. Applied Psychological Measurement, 8, 347–364.
Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: MESA Press.
Xu, X., Douglas, J. A., & Lee, Y. (2011). Linking with nonparametric IRT models. In A. A. von Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 243–258). New York: Springer.
Yao, L. (2011). Multidimensional linking for domain scores and overall scores for nonequivalent groups. Applied Psychological Measurement, 35, 48–66.
Yao, L., & Boughton, K. (2009). Multidimensional linking for tests with mixed item types. Journal of Educational Measurement, 46, 177–197.
Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8, 125–145.
Yen, W., & Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 111–153). Westport, CT: American Council on Education and Praeger.
Zeng, L., & Kolen, M. J. (1994, April). IRT scale transformations using numerical integration. Paper presented at the Annual Meeting of the American Educational Research Association, New Orleans.
Zeng, L., & Kolen, M. J. (1995). An alternative approach for IRT observed-score equating of number-correct scores. Applied Psychological Measurement, 19, 231–240.
Zimowski, M., Muraki, E., Mislevy, R. J., & Bock, R. D. (2003). BILOG-MG (Version 3.0) [Computer software]. Chicago, IL: Scientific Software International.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2014 Springer Science+Business Media New York
About this chapter
Cite this chapter
Kolen, M.J., Brennan, R.L. (2014). Item Response Theory Methods. In: Test Equating, Scaling, and Linking. Statistics for Social and Behavioral Sciences. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-0317-7_6
Download citation
DOI: https://doi.org/10.1007/978-1-4939-0317-7_6
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-0316-0
Online ISBN: 978-1-4939-0317-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)