Skip to main content

Item Response Theory Methods

  • Chapter
  • First Online:
Test Equating, Scaling, and Linking

Abstract

In this chapter, we describe item response theory (IRT) equating methods under various designs. This chapter covers issues that include scaling person and item parameters, IRT true and observed score equating methods, equating using item pools, and equating using polytomous IRT models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Baker, F. B. (1992). Equating tests under the graded response model. Applied Psychological Measurement, 16, 87–96.

    Article  Google Scholar 

  • Baker, F. B. (1993a). Equate 2.0: A computer program for the characteristic curve method of IRT equating. Applied Psychological Measurement, 17, 20.

    Article  Google Scholar 

  • Baker, F. B. (1993b). Equating tests under the nominal response model. Applied Psychological Measurement, 17, 239–251.

    Article  Google Scholar 

  • Baker, F. B. (1996). An investigation of the sampling distributions of equating coefficients. Applied Psychological Measurement, 20, 45–57.

    Article  Google Scholar 

  • Baker, F. B. (1997). Empirical sampling distributions of equating coefficients for graded and nominal response instruments. Applied Psychological Measurement, 21, 157–172.

    Article  Google Scholar 

  • Baker, F. B., & Al-Karni, A. (1991). A comparison of two procedures for computing IRT equating coefficients. Journal of Educational Measurement, 28, 147–162.

    Article  Google Scholar 

  • Baker, F. B., & Kim, S. H. (2004). Item response theory: Parameter estimation techniques (2nd ed.). New York: Marcel Dekker.

    Google Scholar 

  • Béguin, A. A., & Hanson, B. A. (2001, April). Effect of noncompensatory multidimensionality on separate and concurrent estimation in IRT observed score equating. Paper presented at the The Annual Meeting of the National Council on Measurement in Education, Seattle, WA.

    Google Scholar 

  • Béguin, A. A., Hanson, B. A., & Glas, C. A. W. (2000, April). Effect of multidimensionality on separate and concurrent estimation in IRT equating. Paper presented at the American Educational Research Association, New Orleans, LA

    Google Scholar 

  • Bejar, I. I., & Wingersky, M. S. (1982). A study of pre-equating based on item response theory. Applied Psychological Measurement, 6, 309–325.

    Article  Google Scholar 

  • Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29–51.

    Article  MATH  MathSciNet  Google Scholar 

  • Bock, R. D. (1997). The nominal categories model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 34–49). New York: Springer.

    Google Scholar 

  • Bolt, D. M. (1999). Evaluating the effects of multidimensionality on IRT true-score equating. Applied Measurement in Education, 12, 383–407.

    Article  Google Scholar 

  • Brennan, R. L., Wang, T., Kim, S., & Seol, J. (2009). Equating recipes. Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, University of Iowa.

    Google Scholar 

  • Brossman, B. G. (2010). Observed score and true score equating procedures for multidimensional item response theory. (Doctoral Dissertation, University of Iowa). Available from ProQuest Disserations and Theses database. (UMI No. 3409412).

    Google Scholar 

  • Camilli, G., Wang, M.-M., & Fesq, J. (1995). The effects of dimensionality on equating the Law School Admission test. Journal of Educational Measurement, 32, 79–96.

    Article  Google Scholar 

  • Cohen, A. S., & Kim, S. H. (1998). An investigation of linking methods under the graded response model. Applied Psychological Measurement, 22, 116–130.

    Article  Google Scholar 

  • Cook, L. L., Dorans, N. J., Eignor, D. R., & Petersen, N. S. (1985). An assessment of the relationship between the assumption of unidimensionality and the quality of IRT true-score equating (Research Report 85–30). Princeton, NJ: Educational Testing Service.

    Google Scholar 

  • Cook, L. L., & Eignor, D. R. (1991). An NCME instructional module on IRT equating methods. Educational Measurement: Issues and Practice, 10, 37–45.

    MATH  Google Scholar 

  • Cook, L. L., & Petersen, N. S. (1987). Problems related to the use of conventional and item response theory equating methods in less than optimal circumstances. Applied Psychological Measurement, 11, 225–244.

    Article  Google Scholar 

  • Davey, T., Oshima, T. C., & Lee, K. (1996). Linking multidimensional item calibrations. Applied Psychological Measurement, 20, 405–416.

    Article  Google Scholar 

  • de Ayala, R. J. (2009). The theory and practice of item response theory. New York: Guilford.

    Google Scholar 

  • De Champlain, A. F. (1996). The effect of multidimensionality on IRT true-score equating for subgroups of examinees. Journal of Educational Measurement, 33, 181–201.

    Article  Google Scholar 

  • DeMars, C. E. (2002). Incomplete data and item parameter estimates under JMLE and MML estimation. Applied Measurement in Education, 15, 15–31.

    Article  Google Scholar 

  • DeMars, C. E. (2012). Confirming testlet effects. Applied Psychological Measurement, 36, 104–121.

    Article  Google Scholar 

  • DeMars, C. E., & Jurich, D. P. (2012). Software note: Using Bilog for fixed-anchor calibration. Applied Psychological Measurement, 36, 232–236.

    Article  Google Scholar 

  • Divgi, D. R. (1985). A minimum chi-square method for developing a common metric in item response theory. Applied Psychological Measurement, 9, 413–415.

    Article  Google Scholar 

  • Dorans, N. J., & Kingston, N. M. (1985). The effects of violations of unidimensionality on the estimation of item and ability parameters and on item response theory equating of the GRE verbal scale. Journal of Educational Measurement, 22, 249–262.

    Article  Google Scholar 

  • Eignor, D. R. (1985). An investigation of the feasibility and practical outcomes of preequating the SAT verbal and mathematical sections (Research Report 85–10). Princeton, NJ: Educational Testing Service.

    Google Scholar 

  • Eignor, D. R., & Stocking, M. L. (1986). An investigation of the possible causes for the inadequacy of IRT preequating (Research Report 86–14). Princeton, NJ: Educational Testing Service.

    Google Scholar 

  • Glas, C. A. W., & Béguin, A. A. (2011). Robustness of IRT observed-score equating. In A. A. von Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 297–316). New York: Springer.

    Google Scholar 

  • Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, 144–149.

    Google Scholar 

  • Hambleton, R. K., & Swaminathan, H. (1985). Item response theory. Principles and applications. Boston: Kluwer.

    Book  Google Scholar 

  • Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.

    Google Scholar 

  • Han, T., Kolen, M. J., & Pohlmann, J. (1997). A comparison among IRT true- and observed-score equatings and traditional equipercentile equating. Applied Measurement in Education, 10, 105–121.

    Google Scholar 

  • Hanson, B. A. (2002). IRT command language (Version 0.020301, March 1, 2002). Monterey, CA: Author. Retrieved from http://www.b-a-h.com/software/irt/icl/index.html

  • Hanson, B. A., & Béguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26, 3–24.

    Article  MathSciNet  Google Scholar 

  • Harris, D. J. (1989). Comparison of 1-, 2-, and 3-parameter IRT models. Educational Measurement: Issues and Practice, 8, 35–41.

    Google Scholar 

  • Hirsch, T. M. (1989). Multidimensional equating. Journal of Educational Measurement, 26, 337–349.

    Google Scholar 

  • Kaskowitz, G. S., & De Ayala, R. J. (2001). The effect of error in item parameter estimates on the test response function method of linking. Applied Psychological Measurement, 25, 39–52.

    Article  MathSciNet  Google Scholar 

  • Keller, L. A., & Keller, R. R. (2011). The long-term sustainability of different item response theory scaling methods. Educational and Psychological Measurement, 71, 362–379.

    Article  Google Scholar 

  • Keller, L. A., Swaminathan, H., & Sireci, S. G. (2003). Evaluating scoring procedures for context-dependent item sets. Applied Measurement in Education, 16, 207–222.

    Article  Google Scholar 

  • Kim, J. (2006). Using the distractor categories of multiple-choice items to improve IRT linking. Journal of Educational Measurement, 43, 193–213.

    Article  Google Scholar 

  • Kim, J., & Hanson, B. A. (2002). Test equating under the multiple-choice model. Applied Psychological Measurement, 26, 255–270.

    Article  MathSciNet  Google Scholar 

  • Kim, S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43, 355–381.

    Article  Google Scholar 

  • Kim, S. (2010). An extension of least squares estimation of IRT linking coefficients for the graded response model. Applied Psychological Measurement, 34, 505–520.

    Article  Google Scholar 

  • Kim, S., Harris, D. J., & Kolen, M. J. (2010). Equating with polytomous item response models. In M. L. Nering & R. Ostini (Eds.), Handbook of polytomous item response theory models (pp. 257–291). New York, NY: Routledge.

    Google Scholar 

  • Kim, S., & Kolen, M. J. (2005). Methods for obtaining a common scale under unidimensional IRT models: A technical review and further extensions. (Iowa Testing Programs Occasional Papers No. 52). Iowa City, IA: Iowa Testing Programs.

    Google Scholar 

  • Kim, S., & Kolen, M. J. (2007). Effects on scale linking of different definitions of criterion functions for the IRT characteristic curve methods. Journal of Educational and Behavioral Statistics, 32, 371–397.

    Article  Google Scholar 

  • Kim, S., & Lee, W. (2004). IRT scale linking methods for mixed-format tests. (ACT Research Report Series 2004–5). Iowa City, IA: ACT Inc.

    Google Scholar 

  • Kim, S. H., & Cohen, A. S. (1992). Effects of linking methods on detection of DIF. Journal of Educational Measurement, 29, 51–66.

    Article  Google Scholar 

  • Kim, S. H., & Cohen, A. S. (1995). A minimum chi-square method for equating tests under the graded response model. Applied Psychological Measurement, 19, 167–176.

    Article  Google Scholar 

  • Kim, S. H., & Cohen, A. S. (1998). A comparison of linking and concurrent calibration under item response theory. Applied Psychological Measurement, 22, 131–143.

    Article  Google Scholar 

  • Kim, S. H., & Cohen, A. S. (2002). A comparison of linking and concurrent calibration under the graded response model. Applied Psychological Measurement, 26, 25–41.

    Article  MathSciNet  Google Scholar 

  • Koenig, J. A., & Roberts, J. S. (2007). Linking parameters estimated with the generalized graded unfolding model: A comparison of the accuracy of characteristic curve methods. Applied Psychological Measurement, 31, 504–524.

    Article  MathSciNet  Google Scholar 

  • Kolen, M. J. (1981). Comparison of traditional and item response theory methods for equating tests. Journal of Educational Measurement, 18, 1–11.

    Google Scholar 

  • Kolen, M. J., & Harris, D. J. (1990). Comparison of item preequating and random groups equating using IRT and equipercentile methods. Journal of Educational Measurement, 27, 27–39.

    Article  Google Scholar 

  • Lee, G., Kolen, M. J., Frisbie, D. A., & Ankenmann, R. D. (2001). Comparison of dichotomous and polytomous item response models in equating scores from tests composed of testlets. Applied Psychological Measurement, 25, 3–24.

    Article  MathSciNet  Google Scholar 

  • Lee, W., & Ban, J. (2010). A comparison of IRT linking procedures. Applied Measurement in Education, 23, 23–48.

    Article  Google Scholar 

  • Li, D., Jiang, Y., & von Davier, A. A. (2012). The accuracy and consistency of a series of IRT true score equatings. Journal of Educational Measurement, 49, 167–189.

    Article  MATH  Google Scholar 

  • Li, Y., Bolt, D. M., & Fu, J. (2005). A test characteristic curve linking method for the testlet model. Applied Psychological Measurement, 29, 340–356.

    Article  MathSciNet  Google Scholar 

  • Li, Y. H., & Lissitz, R. W. (2000). An evaluation of the accuracy of multidimensional IRT linking. Applied Psychological Measurement, 24, 115–138.

    Google Scholar 

  • Li, Y. H., Tam, H. P., & Tompkins, L. J. (2004). A comparison of using the fixed common-precalibrated parameter method and the matched characteristic curve method for linking multiple-test items. International Journal of Testing, 4, 267–293.

    Article  Google Scholar 

  • Linacre, J. M. (2001). A user’s guide to WINSTEPS/MINISTEPS [Computer software]. Chicago, IL: Winsteps.com.

    Google Scholar 

  • Linn, R. L., Levine, M. V., Hastings, C. N., & Wardrop, J. L. (1981). Item bias in a test of reading comprehension. Applied Psychological Measurement, 5, 159–173.

    Article  Google Scholar 

  • Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.

    Google Scholar 

  • Lord, F. M. (1982). Item response theory and equating—A technical summary. In P. W. Holland & D. B. Rubin (Eds.), Test equating (pp. 141–149). New York: Academic.

    Google Scholar 

  • Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings”. Applied Psychological Measurement, 8, 452–461.

    Article  Google Scholar 

  • Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch Model. Journal of Educational Measurement, 17, 179–193.

    Article  Google Scholar 

  • Marco, G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14, 139–160.

    Article  Google Scholar 

  • Masters, G. N. (1984). Constructing an item bank using partial credit scoring. Journal of Educational Measurement, 21, 19–32.

    Article  Google Scholar 

  • McKinley, R. L. (1988). A comparison of six methods for combining multiple IRT item parameter estimates. Journal of Educational Measurement, 25, 233–246.

    Article  Google Scholar 

  • Mislevy, R. J., & Bock, R. D. (1990). BILOG 3. Item analysis and test scoring with binary logistic models (2nd ed.). Mooresville, IN: Scientific Software.

    Google Scholar 

  • Muraki’s, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176.

    Google Scholar 

  • Muraki’s, E. (1997). A generalized partial credit model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 153–164). New York: Springer.

    Google Scholar 

  • Muraki, E., & Bock, R. D. (2003). PARSCALE (Version 4.1) [Computer software]. Chicago, IL: Scientific Software International.

    Google Scholar 

  • Nering, M., & Ostini, R. (Eds.). (2010). Handbook of polytomous item response models. New York: Routledge.

    Google Scholar 

  • Ogasawara, H. (2000). Asymptotic standard errors of IRT equating coefficients using moments. Economic Review, Otaru University of Commerce, 51, 1–23.

    Google Scholar 

  • Ogasawara, H. (2001a). Item response theory true score equatings and their standard errors. Journal of Educational and Behavioral Statistics, 26, 31–50.

    Article  Google Scholar 

  • Ogasawara, H. (2001b). Least squares estimation of item response theory linking coefficients. Applied Psychological Measurement, 25, 3–24.

    Article  MathSciNet  Google Scholar 

  • Ogasawara, H. (2001c). Marginal maximum likelihood estimation of item response theory (IRT) equating coefficients for the common-examinee design. Japanese Psychological Research, 43, 72–82.

    Article  Google Scholar 

  • Ogasawara, H. (2002). Stable response functions with unstable item parameter estimates. Applied Psychological Measurement, 26, 239–254.

    Article  MathSciNet  Google Scholar 

  • Oshima, T. C., Davey, T. C., & Lee, K. (2000). Multidimensional linking: Four practical approaches. Journal of Educational Measurement, 37, 357–373.

    Article  Google Scholar 

  • Paek, I., & Young, M. J. (2005). Investigation of student growth recovery in a fixed-item linking procedure with a fixed-person prior distribution for mixed-format test data. Applied Measurement in Education, 18, 199–215.

    Article  Google Scholar 

  • Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1989). Numerical recipes. The art of scientific computing (Fortran version). Cambridge, UK: Cambridge University Press.

    Google Scholar 

  • Quenette, M. A., Nicewander, W. A., & Thomasson, G. L. (2006). Model-based versus empirical equating of test forms. Applied Psychological Measurement, 30, 167–182.

    Article  MathSciNet  Google Scholar 

  • Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer.

    Book  Google Scholar 

  • Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research.

    Google Scholar 

  • Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. (Psychometrika Monograph No. 17) Richmond, VA Psychometrics Society.

    Google Scholar 

  • Samejima, F. (1972). A general model for free-response data. Psychometrika Monograph Supplement, 37(1, Pt. 2), 68.

    Google Scholar 

  • Samejima, F. (1997). Graded response model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 85–100). New York: Springer.

    Chapter  Google Scholar 

  • Shojima, K. (2003). Linking tests under the continuous response model. Behaviormetrika, 30, 155–171.

    Google Scholar 

  • Stocking, M. L., & Eignor, D. R. (1986). The impact of different ability distributions on IRT preequating (Research Report 86–49). Princeton, NJ: Educational Testing Service.

    Google Scholar 

  • Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201–210.

    Article  Google Scholar 

  • Thissen, D., Chen, W., & Bock, R. D. (2003). MULTILOG (Version 7.03) [Computer software]. Chicago, IL: Scientific Software International.

    Google Scholar 

  • Thissen, D., Pommerich, M., Billeaud, K., & Williams, V. S. L. (1995). Item response theory for scores on tests including polytomous items with ordered responses. Applied Psychological Measurement, 19, 39–49.

    Article  Google Scholar 

  • Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51, 567–577.

    Article  MATH  Google Scholar 

  • Thissen, D., Steinberg, L., & Mooney, J. A. (1989). Trace lines for testlets: A use of multiple-categorical-response models. Journal of Educational Measurement, 26, 247–260.

    Article  Google Scholar 

  • Vale, C. D. (1986). Linking item parameters onto a common scale. Applied Psychological Measurement, 10, 333–344.

    Article  Google Scholar 

  • van der Linden, W. J., & Hambleton, R. K. (Eds.). (1997). Handbook of modern item response theory. New York: Springer.

    MATH  Google Scholar 

  • von Davier, M., & von Davier, A. A. (2011). A general model for IRT scale linking and scale transformations. In A. A. von Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 225–242). New York: Springer.

    Chapter  Google Scholar 

  • von Davier, A. A., & Wilson, C. (2007). IRT true-score test equating. Educational and Psychological Measurement, 67, 940–957.

    Article  MathSciNet  Google Scholar 

  • Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applications. New York: Cambridge University Press.

    Book  Google Scholar 

  • Wainer, H., Sireci, S. G., & Thissen, D. (1991). Differential testlet functioning: Definitions and detection. Journal of Educational Measurement, 28, 197–219.

    Article  Google Scholar 

  • Wingersky, M. S., Barton, M. A., & Lord, F. M. (1982). LOGIST users guide. Princeton, NJ: Educational Testing Service.

    Google Scholar 

  • Wingersky, M. S., & Lord, F. M. (1984). An investigation of methods for reducing sampling error in certain IRT procedures. Applied Psychological Measurement, 8, 347–364.

    Article  Google Scholar 

  • Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: MESA Press.

    Google Scholar 

  • Xu, X., Douglas, J. A., & Lee, Y. (2011). Linking with nonparametric IRT models. In A. A. von Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 243–258). New York: Springer.

    Google Scholar 

  • Yao, L. (2011). Multidimensional linking for domain scores and overall scores for nonequivalent groups. Applied Psychological Measurement, 35, 48–66.

    Article  Google Scholar 

  • Yao, L., & Boughton, K. (2009). Multidimensional linking for tests with mixed item types. Journal of Educational Measurement, 46, 177–197.

    Article  Google Scholar 

  • Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8, 125–145.

    Article  Google Scholar 

  • Yen, W., & Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 111–153). Westport, CT: American Council on Education and Praeger.

    Google Scholar 

  • Zeng, L., & Kolen, M. J. (1994, April). IRT scale transformations using numerical integration. Paper presented at the Annual Meeting of the American Educational Research Association, New Orleans.

    Google Scholar 

  • Zeng, L., & Kolen, M. J. (1995). An alternative approach for IRT observed-score equating of number-correct scores. Applied Psychological Measurement, 19, 231–240.

    Article  Google Scholar 

  • Zimowski, M., Muraki, E., Mislevy, R. J., & Bock, R. D. (2003). BILOG-MG (Version 3.0) [Computer software]. Chicago, IL: Scientific Software International.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael J. Kolen .

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this chapter

Cite this chapter

Kolen, M.J., Brennan, R.L. (2014). Item Response Theory Methods. In: Test Equating, Scaling, and Linking. Statistics for Social and Behavioral Sciences. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-0317-7_6

Download citation

Publish with us

Policies and ethics