Multistage Test Design Considerations in International Large-Scale Assessments of Educational Achievement

Rutkowski, Leslie; Rutkowski, David; Svetina Valdivia, Dubravka

doi:10.1007/978-3-030-88178-8_63

Leslie Rutkowski^4,5,
David Rutkowski^4,5 &
Dubravka Svetina Valdivia⁴

Part of the book series: Springer International Handbooks of Education ((SIHE))

782 Accesses

Abstract

Numerous choices exist for designing and implementing a multistage test (MST) for dozens of heterogeneous educational systems internationally. In this chapter, we review recent research that focuses on MST in an international large-scale assessment (ILSA) context. To do so, we first describe the inherent heterogeneity and associated measurement challenges of ILSAs, describing how MST offers a means for tailoring assessments to better measure the full achievement distribution while minimizing test burden. We then emphasize design choices and how these impact item and person parameter estimates as well as item exposure rates. We also discuss the tension between fully realizing the promise of an MST design with the primacy of stable trend estimates. Specifically, we discuss the design choices with respect to the structure of MST and panels, related routing decisions within MST, routing methods, module lengths, and position effects.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 379.99; Price excludes VAT (USA)

Hardcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multistage Test Design Considerations in International Large-Scale Assessments of Educational Achievement

Extending International Large-Scale Assessment Instruments to National Needs: Germany’s Approach to TIMSS and PIRLS

Some thoughts on analytical choices in the scaling model for test scores in international large-scale assessment studies

Article Open access 03 September 2022

References

Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443–459. https://doi.org/10.1007/BF02293801
Article Google Scholar
Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6(4), 431–444. https://doi.org/10.1177/014662168200600405
Article Google Scholar
Chen, H., Yamamoto, K., & von Davier, M. (2014). Controlling multistage testing exposure rates in international large-scale assessments. In D. Yan, A. A. von Davier, & C. Lewis (Eds.), Computerized multistage testing: Theory and applications (pp. 391–409). CRC Press.
Google Scholar
Eggen, T., & Verhelst, N. (2011). Item calibration in incomplete testing designs. Psicologica: International Journal of Methodology and Experimental Psychology, 32(1), 107–132.
Google Scholar
ETS. (2016). PISA 2018 integrated design. OECD. https://www.oecd.org/pisa/pisaproducts/PISA-2018-INTEGRATED-DESIGN.pdf
Glas, C. A. W. (1988). The Rasch model and multistage testing. Journal of Educational and Behavioral Statistics, 13(1), 45–52. https://doi.org/10.3102/10769986013001045
Article Google Scholar
Kamens, D. H., & McNeely, C. L. (2010). Globalization and the growth of international educational testing and national assessment. Comparative Education Review, 54(1), 5–25. https://doi.org/10.1086/648578
Article Google Scholar
Kim, H., & Plake, B. S. (1993, April). Monte Carlo simulation comparison of two-stage testing and computerized adaptive testing. Annual meeting of the National Council on measurement in education, Atlanta, GA. https://eric.ed.gov/?id=ED357041
Kirsch, I., & Lennon, M. L. (2017). PIAAC: A new design for a new era. Large-Scale Assessments in Education, 5(1). https://doi.org/10.1186/s40536-017-0046-6
Lord, F. M. (1965). Item sampling in test theory and in research design. Educational Testing Service. http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=AD0619069
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum Associates.
Google Scholar
Luecht, R. M., & Nungester, R. J. (1998). Some practical examples of computer-adaptive sequential testing. Journal of Educational Measurement, 35(3), 229–249. https://doi.org/10.1111/j.1745-3984.1998.tb00537.x
Article Google Scholar
Magis, D., Yan, D., & von Davier, A. A. (2018). mstR: Procedures to generate patterns under multistage testing (1.2) [Computer software]. https://CRAN.R-project.org/package=mstR
Martin, M. O., Mullis, I. V. S., & Foy, P. (2013). TIMSS 2015 assessment design. In I. V. S. Mullis & M. O. Martin (Eds.), TIMSS 2015 assessment frameworks. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College.
Google Scholar
Martin, M., Mullis, I., & Hooper, M. (2016). Methods and procedures in TIMSS 2015. Boston College, TIMSS & PIRLS International Study Center. http://timssandpirls.bc.edu/publications/timss/2015-methods.html
Mullis, I. V. S., & Martin, M. O. (Eds.). (2017). TIMSS 2019 assessment framework. TIMSS & PIRLS International Study Center. https://timssandpirls.bc.edu/timss2015/frameworks.html
Mullis, I. V. S., Martin, M. O., Foy, P., & Hooper, M. (2016). TIMSS 2015 international results in mathematics. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College. http://timssandpirls.bc.edu/timss2015/international-results/timss-2015/mathematics/student-achievement/
OECD. (2010). PISA computer-based assessment of student skills in science. OECD Publishing. http://www.oecd.org/education/school/programmeforinternationalstudentassessmentpisa/pisacomputer-basedassessmentofstudentskillsinscience.htm
OECD. (2013). Technical report of the survey of adult skills (PIAAC). OECD Publishing. https://doi.org/10.1787/9789264204027-en
Book Google Scholar
OECD. (2014). PISA 2012 technical report. OECD Publishing. https://www.oecd.org/pisa/pisaproducts/PISA-2012-technical-report-final.pdf
OECD. (2016). PISA 2015 results: Excellence and equity in education (Vol. I). OECD Publishing.
Book Google Scholar
OECD. (2017a). PISA 2015 assessment and analytical framework. Organisation for Economic Co-operation and Development. https://doi.org/10.1787/9789264281820-en
OECD. (2017b). PISA 2015 technical report. OECD Publishing. http://www.oecd.org/pisa/data/2015-technical-report/
R Core Team. (2020). R: A language and environment for statistical computing (4.0.0) [Computer software]. R Foundation for Statistical Computing. http://www.R-project.org/
Rutkowski, L., Gonzalez, E., von Davier, M., & Zhou, Y. (2014). Assessment design for international large-scale assessment. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international large-scale assessment: Background, technical issues, and methods of data analysis (pp. 75–95). Chapman & Hall/CRC Press.
Google Scholar
Rutkowski, D., Rutkowski, L., & Liaw, Y.-L. (2018). Measuring widening proficiency differences in international assessments: Are current approaches enough? Educational Measurement: Issues and Practice, 37(4), 40–48. https://doi.org/10.1111/emip.12225
Article Google Scholar
Rutkowski, L., Rutkowski, D., & Liaw, Y.-L. (2019). The existence and impact of floor effects for low-performing PISA participants. Assessment in Education: Principles, Policy & Practice, 26(6), 643–664. https://doi.org/10.1080/0969594X.2019.1577219
Article Google Scholar
Rutkowski, L. A., Liaw, Y. -L., Svetina, D., & Rutkowski, D. J. (2020). Multistage testing in heterogeneous populations: Some design and implementation considerations. Manuscript Submitted for Publication.
Google Scholar
Sinharay, S. (2018). On the choice of anchor tests in equating. Educational Measurement: Issues and Practice, 37(2), 64–69. https://doi.org/10.1111/emip.12175
Article Google Scholar
Svetina, D., Liaw, Y.-L., Rutkowski, L., & Rutkowski, D. (2019). Routing strategies and optimizing design for multistage testing in international large-scale assessments. Journal of Educational Measurement, 56(1), 192–213. https://doi.org/10.1111/jedm.12206
Article Google Scholar
TIMSS & PIRLS International Study Center. (n.d.). PIRLS 2021 Group adaptive design – Assessment frameworks. Retrieved September 27, 2021, from http://pirls2021.org/frameworks/home/assessment-design-framework/group-adaptive-design/
Trierweiler, T. J., Lewis, C., & Smith, R. L. (2016). Further study of the choice of anchor tests in equating. Journal of Educational Measurement, 53(4), 498–518. https://doi.org/10.1111/jedm.12128
Article Google Scholar
van de Vijver, F. J. R., & Matsumoto, D. (2011). Introduction to the methodological issues associated with cross-cultural research. In D. Matsumoto & F. J. R. van de Vijver (Eds.), Cross-cultural research methods in psychology. Cambridge University Press.
Google Scholar
Yamamoto, K., Khorramdel, L., & Shin, H. J. (2018a). Introducing multistage adaptive testing into international large-scale assessments designs using the example of PIAAC. Psychological Test and Assessment Modeling, 60(3), 347–368.
Google Scholar
Yamamoto, K., Shin, H. J., & Khorramdel, L. (2018b). Multistage adaptive testing design in international large-scale assessments. Educational Measurement: Issues and Practice, 37(4), 16–27. https://doi.org/10.1111/emip.12226
Article Google Scholar
Yamamoto, K., Shin, H. J., & Khorramdel, L. (2019). Introduction of multistage adaptive testing design in PISA 2018. OECD Publishing.
Google Scholar
Yan, D., Lewis, C., & von Davier, A. A. (2014). Overview of computerized multistage tests. In D. Yan, A. A. von Davier, & C. Lewis (Eds.), Computerized multistage testing: Theory and applications (pp. 3–20). CRC Press.
Google Scholar

Download references

Acknowledgments

This project was partially funded by a grant from the Norwegian Research Council, FINNUT program, # 255246

Author information

Authors and Affiliations

Indiana University, Bloomington, IN, USA
Leslie Rutkowski, David Rutkowski & Dubravka Svetina Valdivia
Centre for Educational Measurement, University of Oslo, Oslo, Norway
Leslie Rutkowski & David Rutkowski

Authors

Leslie Rutkowski
View author publications
You can also search for this author in PubMed Google Scholar
David Rutkowski
View author publications
You can also search for this author in PubMed Google Scholar
Dubravka Svetina Valdivia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leslie Rutkowski .

Editor information

Editors and Affiliations

Department of Teacher Education and School Research, Faculty of Educational Sciences, University of Oslo, Oslo, Norway
Trude Nilsen
IEA Hamburg, Hamburg, Hamburg, Germany
Agnes Stancel-Piątak
University of Gothenburg, Gothenburg, Sweden
Jan-Eric Gustafsson

Section Editor information

University of Wisconsin–Madison, Madison, WI, USA
David Kaplan

Appendix A: Some Technical Details for SLRR19 and RLSR20

RLSR20

Data generation and estimation were conducted in R (R Core Team, 2020). We used mstR 1.2 (Magis et al., 2018) for MST simulation and TAM 3.1-45 (Robitzsch, Kiefer, Wu 2019) for item calibration and population modeling. (The mstR function was custom modified by Magis to allow for probabilistic routing element.) One hundred replications were performed within each condition. We elaborate on our simulation and analysis subsequently. We calibrated items using a 2PL model with the following technical specifications: we used marginal maximum likelihood obtained via the EM algorithm (Bock & Aitkin, 1981); we assumed a different Gaussian distribution for each of nine populations; we did not specify prior item probabilities or starting values; we used quasi Monte Carlo integration with 1000 iterations; and our convergence criteria was set at 0.0001. Item parameters were estimated via a multigroup model, assuming a different normal distribution for each population; we assumed a latent variable distribution of 0/1 for the first population for model identification. Population achievement distributions were estimated using latent regression; to identify the model, we assumed a mean and variance of 0 and 1, respectively, for one population. Because of this identification restriction, the item and person parameters were on a scale determined by the selection of this population. To put the item parameters back onto the generating scale, we used a mean/sigma linking approach.

SLRR19

We utilized a Monte Carlo simulation study to address the research questions in our study via the R (R Core Team, 2020) package mstR 1.2 (Magis et al., 2018) for MST simulation and analyses, and the mirt package (Chalmers 2012) for item parameter calibration. (The mstR function was custom modified by Magis to allow for probabilistic routing element.) We calibrated items using a 2PL model with the following technical specifications: we used marginal maximum likelihood obtained via the EM algorithm (Bock & Aitkin, 1981) with 500 cycles; we assumed a Gaussian distribution for θ; we did not specify prior item probabilities or starting values; we used 61 quadrature points; and our model convergence criteria was set at 0.0001. We used expected a priori estimation for the person-parameter distribution (Bock & Mislevy, 1982). For item calibration, we assumed a latent variable mean and variance of 0 and 1, respectively, for model identification.

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Rutkowski, L., Rutkowski, D., Svetina Valdivia, D. (2022). Multistage Test Design Considerations in International Large-Scale Assessments of Educational Achievement. In: Nilsen, T., Stancel-Piątak, A., Gustafsson, JE. (eds) International Handbook of Comparative Large-Scale Studies in Education. Springer International Handbooks of Education. Springer, Cham. https://doi.org/10.1007/978-3-030-88178-8_63

Download citation

DOI: https://doi.org/10.1007/978-3-030-88178-8_63
Published: 22 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88177-1
Online ISBN: 978-3-030-88178-8
eBook Packages: EducationReference Module Humanities and Social SciencesReference Module Education

Publish with us

Policies and ethics

Multistage Test Design Considerations in International Large-Scale Assessments of Educational Achievement

Abstract

Access this chapter

Similar content being viewed by others

Multistage Test Design Considerations in International Large-Scale Assessments of Educational Achievement

Extending International Large-Scale Assessment Instruments to National Needs: Germany’s Approach to TIMSS and PIRLS

Some thoughts on analytical choices in the scaling model for test scores in international large-scale assessment studies

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Section Editor information

Appendix A: Some Technical Details for SLRR19 and RLSR20

RLSR20

SLRR19

Rights and permissions

Copyright information

About this entry

Cite this entry

Download citation

Publish with us

Navigation

Multistage Test Design Considerations in International Large-Scale Assessments of Educational Achievement

Abstract

Access this chapter

Similar content being viewed by others

Multistage Test Design Considerations in International Large-Scale Assessments of Educational Achievement

Extending International Large-Scale Assessment Instruments to National Needs: Germany’s Approach to TIMSS and PIRLS

Some thoughts on analytical choices in the scaling model for test scores in international large-scale assessment studies

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Section Editor information

Appendix A: Some Technical Details for SLRR19 and RLSR20

Appendix A: Some Technical Details for SLRR19 and RLSR20

RLSR20

SLRR19

Rights and permissions

Copyright information

About this entry

Cite this entry

Download citation

Share this entry

Publish with us

Search

Navigation