Abstract
Numerous choices exist for designing and implementing a multistage test (MST) for dozens of heterogeneous educational systems internationally. In this chapter, we review recent research that focuses on MST in an international large-scale assessment (ILSA) context. To do so, we first describe the inherent heterogeneity and associated measurement challenges of ILSAs, describing how MST offers a means for tailoring assessments to better measure the full achievement distribution while minimizing test burden. We then emphasize design choices and how these impact item and person parameter estimates as well as item exposure rates. We also discuss the tension between fully realizing the promise of an MST design with the primacy of stable trend estimates. Specifically, we discuss the design choices with respect to the structure of MST and panels, related routing decisions within MST, routing methods, module lengths, and position effects.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443–459. https://doi.org/10.1007/BF02293801
Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6(4), 431–444. https://doi.org/10.1177/014662168200600405
Chen, H., Yamamoto, K., & von Davier, M. (2014). Controlling multistage testing exposure rates in international large-scale assessments. In D. Yan, A. A. von Davier, & C. Lewis (Eds.), Computerized multistage testing: Theory and applications (pp. 391–409). CRC Press.
Eggen, T., & Verhelst, N. (2011). Item calibration in incomplete testing designs. Psicologica: International Journal of Methodology and Experimental Psychology, 32(1), 107–132.
ETS. (2016). PISA 2018 integrated design. OECD. https://www.oecd.org/pisa/pisaproducts/PISA-2018-INTEGRATED-DESIGN.pdf
Glas, C. A. W. (1988). The Rasch model and multistage testing. Journal of Educational and Behavioral Statistics, 13(1), 45–52. https://doi.org/10.3102/10769986013001045
Kamens, D. H., & McNeely, C. L. (2010). Globalization and the growth of international educational testing and national assessment. Comparative Education Review, 54(1), 5–25. https://doi.org/10.1086/648578
Kim, H., & Plake, B. S. (1993, April). Monte Carlo simulation comparison of two-stage testing and computerized adaptive testing. Annual meeting of the National Council on measurement in education, Atlanta, GA. https://eric.ed.gov/?id=ED357041
Kirsch, I., & Lennon, M. L. (2017). PIAAC: A new design for a new era. Large-Scale Assessments in Education, 5(1). https://doi.org/10.1186/s40536-017-0046-6
Lord, F. M. (1965). Item sampling in test theory and in research design. Educational Testing Service. http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=AD0619069
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum Associates.
Luecht, R. M., & Nungester, R. J. (1998). Some practical examples of computer-adaptive sequential testing. Journal of Educational Measurement, 35(3), 229–249. https://doi.org/10.1111/j.1745-3984.1998.tb00537.x
Magis, D., Yan, D., & von Davier, A. A. (2018). mstR: Procedures to generate patterns under multistage testing (1.2) [Computer software]. https://CRAN.R-project.org/package=mstR
Martin, M. O., Mullis, I. V. S., & Foy, P. (2013). TIMSS 2015 assessment design. In I. V. S. Mullis & M. O. Martin (Eds.), TIMSS 2015 assessment frameworks. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College.
Martin, M., Mullis, I., & Hooper, M. (2016). Methods and procedures in TIMSS 2015. Boston College, TIMSS & PIRLS International Study Center. http://timssandpirls.bc.edu/publications/timss/2015-methods.html
Mullis, I. V. S., & Martin, M. O. (Eds.). (2017). TIMSS 2019 assessment framework. TIMSS & PIRLS International Study Center. https://timssandpirls.bc.edu/timss2015/frameworks.html
Mullis, I. V. S., Martin, M. O., Foy, P., & Hooper, M. (2016). TIMSS 2015 international results in mathematics. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College. http://timssandpirls.bc.edu/timss2015/international-results/timss-2015/mathematics/student-achievement/
OECD. (2010). PISA computer-based assessment of student skills in science. OECD Publishing. http://www.oecd.org/education/school/programmeforinternationalstudentassessmentpisa/pisacomputer-basedassessmentofstudentskillsinscience.htm
OECD. (2013). Technical report of the survey of adult skills (PIAAC). OECD Publishing. https://doi.org/10.1787/9789264204027-en
OECD. (2014). PISA 2012 technical report. OECD Publishing. https://www.oecd.org/pisa/pisaproducts/PISA-2012-technical-report-final.pdf
OECD. (2016). PISA 2015 results: Excellence and equity in education (Vol. I). OECD Publishing.
OECD. (2017a). PISA 2015 assessment and analytical framework. Organisation for Economic Co-operation and Development. https://doi.org/10.1787/9789264281820-en
OECD. (2017b). PISA 2015 technical report. OECD Publishing. http://www.oecd.org/pisa/data/2015-technical-report/
R Core Team. (2020). R: A language and environment for statistical computing (4.0.0) [Computer software]. R Foundation for Statistical Computing. http://www.R-project.org/
Rutkowski, L., Gonzalez, E., von Davier, M., & Zhou, Y. (2014). Assessment design for international large-scale assessment. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international large-scale assessment: Background, technical issues, and methods of data analysis (pp. 75–95). Chapman & Hall/CRC Press.
Rutkowski, D., Rutkowski, L., & Liaw, Y.-L. (2018). Measuring widening proficiency differences in international assessments: Are current approaches enough? Educational Measurement: Issues and Practice, 37(4), 40–48. https://doi.org/10.1111/emip.12225
Rutkowski, L., Rutkowski, D., & Liaw, Y.-L. (2019). The existence and impact of floor effects for low-performing PISA participants. Assessment in Education: Principles, Policy & Practice, 26(6), 643–664. https://doi.org/10.1080/0969594X.2019.1577219
Rutkowski, L. A., Liaw, Y. -L., Svetina, D., & Rutkowski, D. J. (2020). Multistage testing in heterogeneous populations: Some design and implementation considerations. Manuscript Submitted for Publication.
Sinharay, S. (2018). On the choice of anchor tests in equating. Educational Measurement: Issues and Practice, 37(2), 64–69. https://doi.org/10.1111/emip.12175
Svetina, D., Liaw, Y.-L., Rutkowski, L., & Rutkowski, D. (2019). Routing strategies and optimizing design for multistage testing in international large-scale assessments. Journal of Educational Measurement, 56(1), 192–213. https://doi.org/10.1111/jedm.12206
TIMSS & PIRLS International Study Center. (n.d.). PIRLS 2021 Group adaptive design – Assessment frameworks. Retrieved September 27, 2021, from http://pirls2021.org/frameworks/home/assessment-design-framework/group-adaptive-design/
Trierweiler, T. J., Lewis, C., & Smith, R. L. (2016). Further study of the choice of anchor tests in equating. Journal of Educational Measurement, 53(4), 498–518. https://doi.org/10.1111/jedm.12128
van de Vijver, F. J. R., & Matsumoto, D. (2011). Introduction to the methodological issues associated with cross-cultural research. In D. Matsumoto & F. J. R. van de Vijver (Eds.), Cross-cultural research methods in psychology. Cambridge University Press.
Yamamoto, K., Khorramdel, L., & Shin, H. J. (2018a). Introducing multistage adaptive testing into international large-scale assessments designs using the example of PIAAC. Psychological Test and Assessment Modeling, 60(3), 347–368.
Yamamoto, K., Shin, H. J., & Khorramdel, L. (2018b). Multistage adaptive testing design in international large-scale assessments. Educational Measurement: Issues and Practice, 37(4), 16–27. https://doi.org/10.1111/emip.12226
Yamamoto, K., Shin, H. J., & Khorramdel, L. (2019). Introduction of multistage adaptive testing design in PISA 2018. OECD Publishing.
Yan, D., Lewis, C., & von Davier, A. A. (2014). Overview of computerized multistage tests. In D. Yan, A. A. von Davier, & C. Lewis (Eds.), Computerized multistage testing: Theory and applications (pp. 3–20). CRC Press.
Acknowledgments
This project was partially funded by a grant from the Norwegian Research Council, FINNUT program, # 255246
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Appendix A: Some Technical Details for SLRR19 and RLSR20
Appendix A: Some Technical Details for SLRR19 and RLSR20
RLSR20
Data generation and estimation were conducted in R (R Core Team, 2020). We used mstR 1.2 (Magis et al., 2018) for MST simulation and TAM 3.1-45 (Robitzsch, Kiefer, Wu 2019) for item calibration and population modeling. (The mstR function was custom modified by Magis to allow for probabilistic routing element.) One hundred replications were performed within each condition. We elaborate on our simulation and analysis subsequently. We calibrated items using a 2PL model with the following technical specifications: we used marginal maximum likelihood obtained via the EM algorithm (Bock & Aitkin, 1981); we assumed a different Gaussian distribution for each of nine populations; we did not specify prior item probabilities or starting values; we used quasi Monte Carlo integration with 1000 iterations; and our convergence criteria was set at 0.0001. Item parameters were estimated via a multigroup model, assuming a different normal distribution for each population; we assumed a latent variable distribution of 0/1 for the first population for model identification. Population achievement distributions were estimated using latent regression; to identify the model, we assumed a mean and variance of 0 and 1, respectively, for one population. Because of this identification restriction, the item and person parameters were on a scale determined by the selection of this population. To put the item parameters back onto the generating scale, we used a mean/sigma linking approach.
SLRR19
We utilized a Monte Carlo simulation study to address the research questions in our study via the R (R Core Team, 2020) package mstR 1.2 (Magis et al., 2018) for MST simulation and analyses, and the mirt package (Chalmers 2012) for item parameter calibration. (The mstR function was custom modified by Magis to allow for probabilistic routing element.) We calibrated items using a 2PL model with the following technical specifications: we used marginal maximum likelihood obtained via the EM algorithm (Bock & Aitkin, 1981) with 500 cycles; we assumed a Gaussian distribution for θ; we did not specify prior item probabilities or starting values; we used 61 quadrature points; and our model convergence criteria was set at 0.0001. We used expected a priori estimation for the person-parameter distribution (Bock & Mislevy, 1982). For item calibration, we assumed a latent variable mean and variance of 0 and 1, respectively, for model identification.
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this entry
Cite this entry
Rutkowski, L., Rutkowski, D., Svetina Valdivia, D. (2022). Multistage Test Design Considerations in International Large-Scale Assessments of Educational Achievement. In: Nilsen, T., Stancel-PiÄ…tak, A., Gustafsson, JE. (eds) International Handbook of Comparative Large-Scale Studies in Education. Springer International Handbooks of Education. Springer, Cham. https://doi.org/10.1007/978-3-030-88178-8_63
Download citation
DOI: https://doi.org/10.1007/978-3-030-88178-8_63
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88177-1
Online ISBN: 978-3-030-88178-8
eBook Packages: EducationReference Module Humanities and Social SciencesReference Module Education