Item Response Theory Methods

Kolen, Michael J.; Brennan, Robert L.

doi:10.1007/978-1-4939-0317-7_6

Michael J. Kolen⁵ &
Robert L. Brennan⁶

Part of the book series: Statistics for Social and Behavioral Sciences ((SSBS))

4710 Accesses
1 Citations

Abstract

In this chapter, we describe item response theory (IRT) equating methods under various designs. This chapter covers issues that include scaling person and item parameters, IRT true and observed score equating methods, equating using item pools, and equating using polytomous IRT models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Baker, F. B. (1992). Equating tests under the graded response model. Applied Psychological Measurement, 16, 87–96.
Article Google Scholar
Baker, F. B. (1993a). Equate 2.0: A computer program for the characteristic curve method of IRT equating. Applied Psychological Measurement, 17, 20.
Article Google Scholar
Baker, F. B. (1993b). Equating tests under the nominal response model. Applied Psychological Measurement, 17, 239–251.
Article Google Scholar
Baker, F. B. (1996). An investigation of the sampling distributions of equating coefficients. Applied Psychological Measurement, 20, 45–57.
Article Google Scholar
Baker, F. B. (1997). Empirical sampling distributions of equating coefficients for graded and nominal response instruments. Applied Psychological Measurement, 21, 157–172.
Article Google Scholar
Baker, F. B., & Al-Karni, A. (1991). A comparison of two procedures for computing IRT equating coefficients. Journal of Educational Measurement, 28, 147–162.
Article Google Scholar
Baker, F. B., & Kim, S. H. (2004). Item response theory: Parameter estimation techniques (2nd ed.). New York: Marcel Dekker.
Google Scholar
Béguin, A. A., & Hanson, B. A. (2001, April). Effect of noncompensatory multidimensionality on separate and concurrent estimation in IRT observed score equating. Paper presented at the The Annual Meeting of the National Council on Measurement in Education, Seattle, WA.
Google Scholar
Béguin, A. A., Hanson, B. A., & Glas, C. A. W. (2000, April). Effect of multidimensionality on separate and concurrent estimation in IRT equating. Paper presented at the American Educational Research Association, New Orleans, LA
Google Scholar
Bejar, I. I., & Wingersky, M. S. (1982). A study of pre-equating based on item response theory. Applied Psychological Measurement, 6, 309–325.
Article Google Scholar
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29–51.
Article MATH MathSciNet Google Scholar
Bock, R. D. (1997). The nominal categories model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 34–49). New York: Springer.
Google Scholar
Bolt, D. M. (1999). Evaluating the effects of multidimensionality on IRT true-score equating. Applied Measurement in Education, 12, 383–407.
Article Google Scholar
Brennan, R. L., Wang, T., Kim, S., & Seol, J. (2009). Equating recipes. Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, University of Iowa.
Google Scholar
Brossman, B. G. (2010). Observed score and true score equating procedures for multidimensional item response theory. (Doctoral Dissertation, University of Iowa). Available from ProQuest Disserations and Theses database. (UMI No. 3409412).
Google Scholar
Camilli, G., Wang, M.-M., & Fesq, J. (1995). The effects of dimensionality on equating the Law School Admission test. Journal of Educational Measurement, 32, 79–96.
Article Google Scholar
Cohen, A. S., & Kim, S. H. (1998). An investigation of linking methods under the graded response model. Applied Psychological Measurement, 22, 116–130.
Article Google Scholar
Cook, L. L., Dorans, N. J., Eignor, D. R., & Petersen, N. S. (1985). An assessment of the relationship between the assumption of unidimensionality and the quality of IRT true-score equating (Research Report 85–30). Princeton, NJ: Educational Testing Service.
Google Scholar
Cook, L. L., & Eignor, D. R. (1991). An NCME instructional module on IRT equating methods. Educational Measurement: Issues and Practice, 10, 37–45.
MATH Google Scholar
Cook, L. L., & Petersen, N. S. (1987). Problems related to the use of conventional and item response theory equating methods in less than optimal circumstances. Applied Psychological Measurement, 11, 225–244.
Article Google Scholar
Davey, T., Oshima, T. C., & Lee, K. (1996). Linking multidimensional item calibrations. Applied Psychological Measurement, 20, 405–416.
Article Google Scholar
de Ayala, R. J. (2009). The theory and practice of item response theory. New York: Guilford.
Google Scholar
De Champlain, A. F. (1996). The effect of multidimensionality on IRT true-score equating for subgroups of examinees. Journal of Educational Measurement, 33, 181–201.
Article Google Scholar
DeMars, C. E. (2002). Incomplete data and item parameter estimates under JMLE and MML estimation. Applied Measurement in Education, 15, 15–31.
Article Google Scholar
DeMars, C. E. (2012). Confirming testlet effects. Applied Psychological Measurement, 36, 104–121.
Article Google Scholar
DeMars, C. E., & Jurich, D. P. (2012). Software note: Using Bilog for fixed-anchor calibration. Applied Psychological Measurement, 36, 232–236.
Article Google Scholar
Divgi, D. R. (1985). A minimum chi-square method for developing a common metric in item response theory. Applied Psychological Measurement, 9, 413–415.
Article Google Scholar
Dorans, N. J., & Kingston, N. M. (1985). The effects of violations of unidimensionality on the estimation of item and ability parameters and on item response theory equating of the GRE verbal scale. Journal of Educational Measurement, 22, 249–262.
Article Google Scholar
Eignor, D. R. (1985). An investigation of the feasibility and practical outcomes of preequating the SAT verbal and mathematical sections (Research Report 85–10). Princeton, NJ: Educational Testing Service.
Google Scholar
Eignor, D. R., & Stocking, M. L. (1986). An investigation of the possible causes for the inadequacy of IRT preequating (Research Report 86–14). Princeton, NJ: Educational Testing Service.
Google Scholar
Glas, C. A. W., & Béguin, A. A. (2011). Robustness of IRT observed-score equating. In A. A. von Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 297–316). New York: Springer.
Google Scholar
Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, 144–149.
Google Scholar
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory. Principles and applications. Boston: Kluwer.
Book Google Scholar
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.
Google Scholar
Han, T., Kolen, M. J., & Pohlmann, J. (1997). A comparison among IRT true- and observed-score equatings and traditional equipercentile equating. Applied Measurement in Education, 10, 105–121.
Google Scholar
Hanson, B. A. (2002). IRT command language (Version 0.020301, March 1, 2002). Monterey, CA: Author. Retrieved from http://www.b-a-h.com/software/irt/icl/index.html
Hanson, B. A., & Béguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26, 3–24.
Article MathSciNet Google Scholar
Harris, D. J. (1989). Comparison of 1-, 2-, and 3-parameter IRT models. Educational Measurement: Issues and Practice, 8, 35–41.
Google Scholar
Hirsch, T. M. (1989). Multidimensional equating. Journal of Educational Measurement, 26, 337–349.
Google Scholar
Kaskowitz, G. S., & De Ayala, R. J. (2001). The effect of error in item parameter estimates on the test response function method of linking. Applied Psychological Measurement, 25, 39–52.
Article MathSciNet Google Scholar
Keller, L. A., & Keller, R. R. (2011). The long-term sustainability of different item response theory scaling methods. Educational and Psychological Measurement, 71, 362–379.
Article Google Scholar
Keller, L. A., Swaminathan, H., & Sireci, S. G. (2003). Evaluating scoring procedures for context-dependent item sets. Applied Measurement in Education, 16, 207–222.
Article Google Scholar
Kim, J. (2006). Using the distractor categories of multiple-choice items to improve IRT linking. Journal of Educational Measurement, 43, 193–213.
Article Google Scholar
Kim, J., & Hanson, B. A. (2002). Test equating under the multiple-choice model. Applied Psychological Measurement, 26, 255–270.
Article MathSciNet Google Scholar
Kim, S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43, 355–381.
Article Google Scholar
Kim, S. (2010). An extension of least squares estimation of IRT linking coefficients for the graded response model. Applied Psychological Measurement, 34, 505–520.
Article Google Scholar
Kim, S., Harris, D. J., & Kolen, M. J. (2010). Equating with polytomous item response models. In M. L. Nering & R. Ostini (Eds.), Handbook of polytomous item response theory models (pp. 257–291). New York, NY: Routledge.
Google Scholar
Kim, S., & Kolen, M. J. (2005). Methods for obtaining a common scale under unidimensional IRT models: A technical review and further extensions. (Iowa Testing Programs Occasional Papers No. 52). Iowa City, IA: Iowa Testing Programs.
Google Scholar
Kim, S., & Kolen, M. J. (2007). Effects on scale linking of different definitions of criterion functions for the IRT characteristic curve methods. Journal of Educational and Behavioral Statistics, 32, 371–397.
Article Google Scholar
Kim, S., & Lee, W. (2004). IRT scale linking methods for mixed-format tests. (ACT Research Report Series 2004–5). Iowa City, IA: ACT Inc.
Google Scholar
Kim, S. H., & Cohen, A. S. (1992). Effects of linking methods on detection of DIF. Journal of Educational Measurement, 29, 51–66.
Article Google Scholar
Kim, S. H., & Cohen, A. S. (1995). A minimum chi-square method for equating tests under the graded response model. Applied Psychological Measurement, 19, 167–176.
Article Google Scholar
Kim, S. H., & Cohen, A. S. (1998). A comparison of linking and concurrent calibration under item response theory. Applied Psychological Measurement, 22, 131–143.
Article Google Scholar
Kim, S. H., & Cohen, A. S. (2002). A comparison of linking and concurrent calibration under the graded response model. Applied Psychological Measurement, 26, 25–41.
Article MathSciNet Google Scholar
Koenig, J. A., & Roberts, J. S. (2007). Linking parameters estimated with the generalized graded unfolding model: A comparison of the accuracy of characteristic curve methods. Applied Psychological Measurement, 31, 504–524.
Article MathSciNet Google Scholar
Kolen, M. J. (1981). Comparison of traditional and item response theory methods for equating tests. Journal of Educational Measurement, 18, 1–11.
Google Scholar
Kolen, M. J., & Harris, D. J. (1990). Comparison of item preequating and random groups equating using IRT and equipercentile methods. Journal of Educational Measurement, 27, 27–39.
Article Google Scholar
Lee, G., Kolen, M. J., Frisbie, D. A., & Ankenmann, R. D. (2001). Comparison of dichotomous and polytomous item response models in equating scores from tests composed of testlets. Applied Psychological Measurement, 25, 3–24.
Article MathSciNet Google Scholar
Lee, W., & Ban, J. (2010). A comparison of IRT linking procedures. Applied Measurement in Education, 23, 23–48.
Article Google Scholar
Li, D., Jiang, Y., & von Davier, A. A. (2012). The accuracy and consistency of a series of IRT true score equatings. Journal of Educational Measurement, 49, 167–189.
Article MATH Google Scholar
Li, Y., Bolt, D. M., & Fu, J. (2005). A test characteristic curve linking method for the testlet model. Applied Psychological Measurement, 29, 340–356.
Article MathSciNet Google Scholar
Li, Y. H., & Lissitz, R. W. (2000). An evaluation of the accuracy of multidimensional IRT linking. Applied Psychological Measurement, 24, 115–138.
Google Scholar
Li, Y. H., Tam, H. P., & Tompkins, L. J. (2004). A comparison of using the fixed common-precalibrated parameter method and the matched characteristic curve method for linking multiple-test items. International Journal of Testing, 4, 267–293.
Article Google Scholar
Linacre, J. M. (2001). A user’s guide to WINSTEPS/MINISTEPS [Computer software]. Chicago, IL: Winsteps.com.
Google Scholar
Linn, R. L., Levine, M. V., Hastings, C. N., & Wardrop, J. L. (1981). Item bias in a test of reading comprehension. Applied Psychological Measurement, 5, 159–173.
Article Google Scholar
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.
Google Scholar
Lord, F. M. (1982). Item response theory and equating—A technical summary. In P. W. Holland & D. B. Rubin (Eds.), Test equating (pp. 141–149). New York: Academic.
Google Scholar
Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings”. Applied Psychological Measurement, 8, 452–461.
Article Google Scholar
Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch Model. Journal of Educational Measurement, 17, 179–193.
Article Google Scholar
Marco, G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14, 139–160.
Article Google Scholar
Masters, G. N. (1984). Constructing an item bank using partial credit scoring. Journal of Educational Measurement, 21, 19–32.
Article Google Scholar
McKinley, R. L. (1988). A comparison of six methods for combining multiple IRT item parameter estimates. Journal of Educational Measurement, 25, 233–246.
Article Google Scholar
Mislevy, R. J., & Bock, R. D. (1990). BILOG 3. Item analysis and test scoring with binary logistic models (2nd ed.). Mooresville, IN: Scientific Software.
Google Scholar
Muraki’s, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176.
Google Scholar
Muraki’s, E. (1997). A generalized partial credit model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 153–164). New York: Springer.
Google Scholar
Muraki, E., & Bock, R. D. (2003). PARSCALE (Version 4.1) [Computer software]. Chicago, IL: Scientific Software International.
Google Scholar
Nering, M., & Ostini, R. (Eds.). (2010). Handbook of polytomous item response models. New York: Routledge.
Google Scholar
Ogasawara, H. (2000). Asymptotic standard errors of IRT equating coefficients using moments. Economic Review, Otaru University of Commerce, 51, 1–23.
Google Scholar
Ogasawara, H. (2001a). Item response theory true score equatings and their standard errors. Journal of Educational and Behavioral Statistics, 26, 31–50.
Article Google Scholar
Ogasawara, H. (2001b). Least squares estimation of item response theory linking coefficients. Applied Psychological Measurement, 25, 3–24.
Article MathSciNet Google Scholar
Ogasawara, H. (2001c). Marginal maximum likelihood estimation of item response theory (IRT) equating coefficients for the common-examinee design. Japanese Psychological Research, 43, 72–82.
Article Google Scholar
Ogasawara, H. (2002). Stable response functions with unstable item parameter estimates. Applied Psychological Measurement, 26, 239–254.
Article MathSciNet Google Scholar
Oshima, T. C., Davey, T. C., & Lee, K. (2000). Multidimensional linking: Four practical approaches. Journal of Educational Measurement, 37, 357–373.
Article Google Scholar
Paek, I., & Young, M. J. (2005). Investigation of student growth recovery in a fixed-item linking procedure with a fixed-person prior distribution for mixed-format test data. Applied Measurement in Education, 18, 199–215.
Article Google Scholar
Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1989). Numerical recipes. The art of scientific computing (Fortran version). Cambridge, UK: Cambridge University Press.
Google Scholar
Quenette, M. A., Nicewander, W. A., & Thomasson, G. L. (2006). Model-based versus empirical equating of test forms. Applied Psychological Measurement, 30, 167–182.
Article MathSciNet Google Scholar
Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer.
Book Google Scholar
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research.
Google Scholar
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. (Psychometrika Monograph No. 17) Richmond, VA Psychometrics Society.
Google Scholar
Samejima, F. (1972). A general model for free-response data. Psychometrika Monograph Supplement, 37(1, Pt. 2), 68.
Google Scholar
Samejima, F. (1997). Graded response model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 85–100). New York: Springer.
Chapter Google Scholar
Shojima, K. (2003). Linking tests under the continuous response model. Behaviormetrika, 30, 155–171.
Google Scholar
Stocking, M. L., & Eignor, D. R. (1986). The impact of different ability distributions on IRT preequating (Research Report 86–49). Princeton, NJ: Educational Testing Service.
Google Scholar
Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201–210.
Article Google Scholar
Thissen, D., Chen, W., & Bock, R. D. (2003). MULTILOG (Version 7.03) [Computer software]. Chicago, IL: Scientific Software International.
Google Scholar
Thissen, D., Pommerich, M., Billeaud, K., & Williams, V. S. L. (1995). Item response theory for scores on tests including polytomous items with ordered responses. Applied Psychological Measurement, 19, 39–49.
Article Google Scholar
Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51, 567–577.
Article MATH Google Scholar
Thissen, D., Steinberg, L., & Mooney, J. A. (1989). Trace lines for testlets: A use of multiple-categorical-response models. Journal of Educational Measurement, 26, 247–260.
Article Google Scholar
Vale, C. D. (1986). Linking item parameters onto a common scale. Applied Psychological Measurement, 10, 333–344.
Article Google Scholar
van der Linden, W. J., & Hambleton, R. K. (Eds.). (1997). Handbook of modern item response theory. New York: Springer.
MATH Google Scholar
von Davier, M., & von Davier, A. A. (2011). A general model for IRT scale linking and scale transformations. In A. A. von Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 225–242). New York: Springer.
Chapter Google Scholar
von Davier, A. A., & Wilson, C. (2007). IRT true-score test equating. Educational and Psychological Measurement, 67, 940–957.
Article MathSciNet Google Scholar
Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applications. New York: Cambridge University Press.
Book Google Scholar
Wainer, H., Sireci, S. G., & Thissen, D. (1991). Differential testlet functioning: Definitions and detection. Journal of Educational Measurement, 28, 197–219.
Article Google Scholar
Wingersky, M. S., Barton, M. A., & Lord, F. M. (1982). LOGIST users guide. Princeton, NJ: Educational Testing Service.
Google Scholar
Wingersky, M. S., & Lord, F. M. (1984). An investigation of methods for reducing sampling error in certain IRT procedures. Applied Psychological Measurement, 8, 347–364.
Article Google Scholar
Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: MESA Press.
Google Scholar
Xu, X., Douglas, J. A., & Lee, Y. (2011). Linking with nonparametric IRT models. In A. A. von Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 243–258). New York: Springer.
Google Scholar
Yao, L. (2011). Multidimensional linking for domain scores and overall scores for nonequivalent groups. Applied Psychological Measurement, 35, 48–66.
Article Google Scholar
Yao, L., & Boughton, K. (2009). Multidimensional linking for tests with mixed item types. Journal of Educational Measurement, 46, 177–197.
Article Google Scholar
Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8, 125–145.
Article Google Scholar
Yen, W., & Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 111–153). Westport, CT: American Council on Education and Praeger.
Google Scholar
Zeng, L., & Kolen, M. J. (1994, April). IRT scale transformations using numerical integration. Paper presented at the Annual Meeting of the American Educational Research Association, New Orleans.
Google Scholar
Zeng, L., & Kolen, M. J. (1995). An alternative approach for IRT observed-score equating of number-correct scores. Applied Psychological Measurement, 19, 231–240.
Article Google Scholar
Zimowski, M., Muraki, E., Mislevy, R. J., & Bock, R. D. (2003). BILOG-MG (Version 3.0) [Computer software]. Chicago, IL: Scientific Software International.
Google Scholar

Download references

Author information

Authors and Affiliations

Iowa Testing Programs, University of Iowa, Iowa City, IA, USA
Michael J. Kolen
CASMA, University of Iowa, Iowa City, IA, USA
Robert L. Brennan

Authors

Michael J. Kolen
View author publications
You can also search for this author in PubMed Google Scholar
Robert L. Brennan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael J. Kolen .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kolen, M.J., Brennan, R.L. (2014). Item Response Theory Methods. In: Test Equating, Scaling, and Linking. Statistics for Social and Behavioral Sciences. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-0317-7_6

Download citation

DOI: https://doi.org/10.1007/978-1-4939-0317-7_6
Published: 14 January 2014
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-0316-0
Online ISBN: 978-1-4939-0317-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics