Software-basierte Evaluation freier Antwortformate

Härtig, Hendrik

doi:10.1007/s40573-014-0012-6

Software-basierte Evaluation freier Antwortformate

Original Paper
Published: 21 August 2014

Volume 20, pages 115–128, (2014)
Cite this article

Zeitschrift für Didaktik der Naturwissenschaften Aims and scope Submit manuscript

Hendrik Härtig¹

579 Accesses
1 Citation
Explore all metrics

Zusammenfassung

Bislang werden in größeren Forschungsprojekten aus Gründen der Ökonomie und Auswerteobjektivität bevorzugt Paper-Pencil Instrumente mit gebundenen Aufgabenformaten verwendet. Es zeigt sich, dass zumindest eine Ergänzung um freie Antwortformate sinnvoll im Hinblick auf die Validität der Tests ist, dem steht aber der immense Aufwand hinsichtlich einer reliablen Auswertung entgegen. Inzwischen existieren jedoch Softwarepakete, die in der Lage sind, mittels semantischer Analysen längere schriftsprachliche Antworten zu verarbeiten und zu kategorisieren. Solche Verfahren könnten genutzt werden, um auch bei vielen ProbandInnen freie Antwortformate einzusetzen und diese Software-basiert zu evaluieren. Für englischsprachige Projekte im Bereich Fachwissen liegen bereits erste ermutigende Befunde für große ProbandInnenzahlen vor. Hier wird die Software-basierte Evaluation in einer sehr kleinen Stichprobe für acht Aufgaben eines deutschsprachigen Instruments zur Erfassung des fachdidaktischen Wissens in Physik erprobt. Dabei lassen sich einerseits gute bis sehr guter Übereinstimmungen zwischen der Software und menschlichen ExpertInnen erzielen. Andererseits werden aber auch Optimierungsoptionen deutlich. Vor allem die Zahl der ProbandInnen aber auch die Aufgabengestaltung bietet Verbesserungspotential.

Abstract

With respect to the tremendous work and costs, the majority of large scale assessments utilize closed-response item formats. However it can be shown that open-ended question could at least increase the validity of such assessments. Within the last few years software has been developed which can process and categorize written answers based on latent semantic analyses. So far there are some studies evaluating the possibility to use such software packages and apply a software-based evaluation of open-ended test items. First results for large groups of English speaking students within an assessment of conceptual understanding are encouraging. Within the study presented here, the applicability of software-based evaluation of open-ended test items has been proven for a small scale assessment of physics PCK. On the one hand good to perfect agreement for half of the eight items between the software and human experts was found. On the other hand we identified two possibilities for increasing the agreement: The size of the sample might have been too small for some of the items. Furthermore the structure of the items may influence the results significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

Es muss darauf hingewiesen werden, dass es sich hierbei um Aufgaben aus der Pilotierung handelt. Für die Hauptstudie wurden die Aufgaben aufgrund der Pilotierung und einer Expertenbefragung überarbeitet und ausgewählt. So ist in der aktuellen Fassung beispielsweise nicht mehr von einem Atomkraftwerk die Rede.
Dieses Beispiel wurde bewusst so einfach gewählt, um hier die grundlegenden Prinzipien so leicht zugänglich wie möglich zu machen. Bei den Aufgaben in den Online Ergänzungen sind alle Daten auch für die SBE angegeben, dort sind somit auch erheblich komplexere Beispiele zugänglich.

Literatur

Abell, S. K. (2007). Research on science teacher knowledge. In S. K. Abell & N. G. Lederman (Hrsg.), Handbook of research on science education S. 1105–1149.
Google Scholar
Baker, E. L., & Mayer, R. E. (1999). Computer-based assessment of problem solving. Computers in Human Behavior, 15, 269–282.
Article Google Scholar
Banerjee, M., Capozzoli, M., McSweeney, L., & Sinha, D. (1999). Beyond kappa: A review of interrater agreement measures. Canadian Journal of Statistics, 27(1), 3–23.
Article Google Scholar
Bejar, I. I. (1991). A methodology for scoring open-ended architectural design problems. Journal of Applied Psychology, 76(4), 522–532.
Article Google Scholar
Bennett, R. E., Rock, D. A., Braun, H. I., Frye, D., Spohrer, J. C., & Soloway, E. (1990). The relationship of expert-system scored constrained free-response items to multiple-choice and open-ended items. Applied Psychological Measurement, 14(2), 151–162. doi:10.1177/014662169001400204
Article Google Scholar
Bonett, D. G., & Price, R. M. (2007). Statistical inference for generalized yule coefficients in 2 × 2 contingency tables. Sociological Methods & Research, 35(3), 429–446. doi:10.1177/0049124106292358
Article Google Scholar
Bortz, J. (1993). Statistik. Für Sozialwissenschaftler (3. Aufl., S. 201 ff.). Berlin: Springer.
Google Scholar
Braun, H. I., Bennett, R. E., Frye, D., & Soloway, E. (1990). Scoring constructed responses using expert systems. Journal of Educational Measurement, 27(2), 93–108.
Article Google Scholar
Briggs, D., Alonzo, A., Schwab, C., & Wilson, M. (2006). Diagnostic assessment with ordered multiple-choice items. Educational Assessment, 11(1), 33–63. doi:10.1207/s15326977ea1101_2
Article Google Scholar
Bühner, M. (2006). Einführung in die Test- und Fragebogenkonstruktion (2. Aufl.). München: Pearson Studium.
Google Scholar
Burstein, J., Leacock, C., & Swartz, R. (2001a). Automated Evaluation of Essays and Short Answers.
Burstein, J., Marcu, D., Andreyev, S., & Chodorow, M. (2001b). Towards automatic classification of discourse elements in essays. In Proceedings of the 39th annual Meeting on Association for Computational Linguistics (S. 98–105). Association for Computational Linguistics.
Chodorow, M., & Burstein, J. C. (2004). Beyond Essay Length: Evaluating e-rater’s Performance on TOEFL Essays. Princeton: Educational Testing Service.
Google Scholar
Clauser, B. E., Ross, L. P., Clyman, S. G., Rose, K. M., Margolis, M. J., Nungester, R. J., et al. (1997). Development of a scoring algorithm to replace expert rating for scoring a complex performance-based assessment. Applied Measurement in Education, 10(4), 345–358.
Article Google Scholar
Darling-Hammond, L. (2006). Assessing teacher education: The usefulness of multiple measures for assessing program outcomes. Journal of Teacher Education, 57(2), 120–138. doi:10.1177/0022487105283796
Article Google Scholar
DeMars, C. E. (2000). Test stakes and item format interactions. Applied Measurement in Education, 13(1), 55–77. doi:10.1207/s15324818ame1301_3
Article Google Scholar
Dessus, P., Lemaire, B., & Vernier, A. (2000). Free-text assessment in a virtual campus. In Proceedings of the 3rd International Conference on Human-Learning Systems.
Ha, M., Nehm, R. H., Urban-Lurain, M., & Merrill, J. E. (2011). Applying computerized-scoring models of written biological explanations across courses and colleges: prospects and limitations. Cell Biology Education, 10(4), 379–393. doi:10.1187/cbe.11-08-0081
Article Google Scholar
Hadenfeldt, J. C., & Neumann, K. (2012) Die Erfassung des Verständnisses von Materie durch Ordered Multiple Choice Aufgaben. Zeitschrift für Didaktik der Naturwissenschaften, 2012, 317–338.
Google Scholar
Haudek, K. C., Kaplan, J. J., Knight, J., Long, T., Merrill, J., Munn, A., et al. (2011). Harnessing Technology to Improve Formative Assessment of Student Conceptions in STEM: Forging a National Network. Cell Biology Education, 10(2), 149–155. doi:10.1187/cbe.11-03-0019
Article Google Scholar
Hestenes, D., & Halloum, I. (1995). Interpreting the Force Concept Inventory. The Physics Teacher, 33, 502–506.
Article Google Scholar
Hestenes, D., Wells, M., & Swackhamer, G. (1992). Force concept inventory. The Physics Teacher, 30(3), 141–166.
Article Google Scholar
Hill, H. C., Ball, D. L., & Schilling, S. G. (2008). Unpacking pedagogical content knowledge: Conceptualizing and measuring teachersʼ topic-specific knowledge of students. Journal for Research in Mathematics Education, 39(4)372–400.
Google Scholar
Kang, S. H. K., McDermott, K. B., & Roediger, H. L. (2007). Test format and corrective feedback modify the effect of testing on long-term retention. European Journal of Cognitive Psychology, 19(4–5), 528–558. doi:10.1080/09541440601056620
Article Google Scholar
Klauer, K. C. (1996). Urteilerübereinstimmung bei dichotomen Kategoriensystemen. Diagnostica, 42, 101–118.
Koirala, H. P., Davis, M., & Johnson, P. (2008). Development of a performance assessment task and rubric to measure prospective secondary school mathematics teachersʼ pedagogical content knowledge and skills. Journal of Mathematics Teacher Education, 11(2), 127–138. doi:10.1007/s10857-007-9067-3
Article Google Scholar
Kröger, J., Euler, M., Neumann, K., Härtig, H., & Petersen, S. (2012). Messung Professioneller Kompetenz im Fach Physik. In S. Bernholt (Hrsg.), Konzepte fachdidaktischer Strukturierung für den Unterricht. Gesellschaft für Didaktik der Chemie und Physik, [38.] Jahrestagung in Oldenburg 2011 (1. Aufl.). Berlin: LIT.
Google Scholar
Kuechler, W. L., & Simkin, M. G. (2010). Why is performance onmultiple-choice tests and constructed-response tests notmore closely related? Theory and an empirical test. Decision Sciences Journal of Innovative Education, 8(1), 55–73.
Article Google Scholar
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 159–174.
Lavigne, J. V., Arend, R., Rosenbaum, D., Sinacore, J., Cicchetti, C., Binns, H. J., et al. (1994). Interrater reliability of the DSM-III-R with preschool children. Journal of abnormal child psychology, 22(6), 679–690.
Article Google Scholar
Magnusson, S., Krajcik, J., & Borko, H. (2002). Nature, sources, and development of pedagogical content knowledge for science teaching. In J. Gess-Newsome & N. G. Lederman (Hrsg.), Science & technology education library. Examining pedagogical content knowledge (S. 95–132). Dordrecht: Kluwer Academic Publishers.
Chapter Google Scholar
Martinez, M. E. (1999). Cognition and the question of test item format. Educational Psychologist, 34(4), 207–218. doi:10.1207/s15326985ep3404_2
Article Google Scholar
Nehm, R. H., & Haertig, H. (2012). Human vs. Computer Diagnosis of studentsʼ natural selection knowledge: Testing the efficacy of text analytic software. Journal of Science Education and Technology, 21(1), 56–73. doi:10.1007/s10956-011-9282-7
Article Google Scholar
Nehm, R. H., & Schonfeld, I. S. (2008). Measuring knowledge of natural selection: A comparison of the CINS, an open-response instrument, and an oral interview. Journal of Research in Science Teaching, 45(10), 1131–1160. doi:10.1002/tea.20251
Article Google Scholar
Opfer, J. E., Nehm, R. H., & Ha, M. (2012). Cognitive foundations for science assessment design: Knowing what students know about evolution. Journal of Research in Science Teaching, 49(6), 744–777. doi:10.1002/tea.21028
Article Google Scholar
Rector, M. A., Nehm, R. H., & Pearl, D. (2012/2013). Learning the language of evolution: Lexical ambiguity and word meaning in student explanations. Research in Science Education, 43(3), 1107–1133. doi:10.1007/s11165-012-9296-z
Article Google Scholar
Riese, J., & Reinhold, P. (2010). Empirische Erkenntnisse zur Struktur professioneller Handlungskompetenz von angehenden Physiklehrkräfte. Zeitschrift für Didaktik der Naturwissenschaften, 16, 167–187.
Google Scholar
Rodriguez, M. C. (2003). Construct Equivalence of Multiple-Choice and Constructed-Response Items: A Random Effects Synthesis of Correlations. Journal of Educational Measurement, 40(2), 163–184. doi:10.1111/j.1745-3984.2003.tb01102.x
Article Google Scholar
Spitznagel, E. L., & Helzer, J. E. (1985). A proposed solution to the base rate problem in the kappa statistic. Archives of General Psychiatry, 42(7), 725.
Article Google Scholar
Tepner, O., Borowski, A., Fischer, H. E., Jüttner, M., Kirschner, S., Leutner, D., et al. (2012). Modell zur Entwicklung von Testitems zur Erfassung des Professionswissens von Lehrkräften in den Naturwissenschaften. Zeitschrift für Didaktik der Naturwissenschaften, 18, 7–28.
Google Scholar
Williamson, D. M., Bejar, I. I., & Hone, A. S. (1999). ‘Mental Model’ comparison of automated and human scoring. Journal of Educational Measurement, 36(2), 158–184.
Article Google Scholar
Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Pracitice, 31(1), 2–13.
Article Google Scholar
Wirtz, M. (2002). Beurteilerübereinstimmung und Beurteilerreliabilität: Methoden zur Bestimmung und Verbesserung der Zuverlässigkeit von Einschätzungen mittels Kategoriensystemen und Ratingskalen. Göttingen: Hogrefe.
Google Scholar
Wohlpart, A. J., Lindsey, C., & Rademacher, C. (2008). The reliability of computer software to score essays: Innovations in a humanities course. Computers and Composition, 25(2), 203–223. doi:10.1016/j.compcom.2008.04.001
Article Google Scholar
Yang, Y., Buckendahl, C. W., Juszkiewicz, P. J., & Bhola, D. S. (2002). A review of strategies for validating computer-automated scoring. Applied Measurement in Education, 15(4), 391–412.
Article Google Scholar
Zenisky, A. L., & Sireci, S. G. (2002). Technological Innovations in Large-Scale Assessment. Applied Measurement in Education, 15(4), 337–362.
Article Google Scholar

Download references

Danksagung

Ich möchte zwei unbekannten Reviewern für die hilfreichen Rückmeldungen zum Manuskript danken. Außerdem gilt mein Dank der Deutschen Forschungsgemeinschaft für die Förderung dieses Projekts und den Antragstellerinnen und Antragstellern des Projekts „Messung professioneller Kompetenzen in mathematischen und naturwissenschaftlichen Lehramtsstudiengängen“ für die freundliche Kooperation.

Author information

Authors and Affiliations

IPN – Leibniz-Institut für die Pädagogik der Naturwissenschaften und Mathematik in Kiel, Kiel, Deutschland
Hendrik Härtig

Authors

Hendrik Härtig
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hendrik Härtig.

Elektronisches zusätzliches Material

(PDF 266 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Härtig, H. Software-basierte Evaluation freier Antwortformate. ZfDN 20, 115–128 (2014). https://doi.org/10.1007/s40573-014-0012-6

Download citation

Received: 16 June 2014
Accepted: 26 June 2014
Published: 21 August 2014
Issue Date: November 2014
DOI: https://doi.org/10.1007/s40573-014-0012-6

Schlüsselwörter

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Software-basierte Evaluation freier Antwortformate

Zusammenfassung

Abstract

Access this article

Notes

Literatur

Danksagung

Author information

Authors and Affiliations

Corresponding author

Elektronisches zusätzliches Material

(PDF 266 kb)

Rights and permissions

About this article

Cite this article

Share this article

Schlüsselwörter

Keywords

Search

Navigation