Content-specificity of teachers' judgment accuracy regarding students' academic achievement

Teachers’ accuracy in judging students’ achievement is often assumed to be a general ability of teachers. Based on this assumption, teachers should be at least consistent in their accuracy across different content domains within a school subject. Yet, this assumption has rarely been investigated empirically so far. Data from 54 mathematics teachers (N 1⁄4 1170 students) and 55 language teachers (N 1⁄4 1255 students) were analysed using a Bayesian multivariate multilevel modelling approach. Results indicate that latent accuracy measures across content domains indeed are substantially correlated within both investigated subjects, but may still be considered to represent different dimensions. © 2021 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). Judging students’ academic abilities is an important task of teachers, which drives their daily decision-making. Teachers’ judgments influence, for example, their lesson planning, the selection and difficulty level of learning activities and materials, and serve as basis for adaptive interactions with their students (Alvidrez & Weinstein, 1999; Herppich et al., 2018;; Loibl, Leuders, & D€ orfler, 2020). The ability of teachers to make accurate judgments, also referred to as teachers’ judgment accuracy, is therefore considered a necessary condition for meaningful teaching activities, especially in terms of optimal tailoring teaching to students’ strengths and needs (Begeny, Eckert, Montarello, & Storie, 2008; Hoge& Coladarci, 1989; Pielmeier, Huber, & Seidel, 2018). Adaptive teaching behaviour is in turn related to positive student academic outcomes (Brühwiler & Blatchford, 2011; Corno, 2008; Parsons et al., 2018). Therefore, it is assumed that a high level of teachers’ judgment accuracy has a positive impact on teaching effectiveness * Corresponding author. E-mail addresses: dimitra.kolovou@phsg.ch (D. Kolovou), Naumanna@dipf.de (A. Naumann), jan.hochweber@phsg.ch (J. Hochweber), anna.praetorius@ife.uzh.ch (A.-K. Praetorius).

Previous studies on teachers' judgment accuracy of student achievement have typically focused on one single academic domain. 1 That is, studies have investigated either judgments of overall achievement in one subject (i.e., subject domain) or in a single content area within a subject (i.e., content domain). Researchers have commonly assumed that judgment accuracy is a general ability of the teacher (Artelt & Rausch, 2014;Hurwitz, Elliott, & Braden, 2007;Schrader, 2010) and can therefore be generalised across (content) domains. Based on this assumption, studies have often used single measures of accuracy in one specific content domain (e.g., arithmetics) to examine how accurate teachers are in judging students' ability in a subject as a whole (e.g., mathematics; Gabriele, Joram, & Park, 2016;Lorenz & Artelt, 2009). Yet, there is a distinct lack of studies that investigate whether it is possible to infer from teachers' judgment accuracy in one content domain (e.g., arithmetic) that teachers will also judge accurately in other content domains (e.g., geometry) of a specific subject (for an exception see Lorenz & Artelt, 2009).
The question of content-generality versus content-specificity is relevant for understanding the structure of judgment accuracy, that is, whether judgment accuracy can be mapped to different content domains and therefore is content-general or whether it consists of distinguishable content-specific facets (Herppich et al., 2018). Clarifying the structure of teachers' judgment accuracy is also important with respect to its measurement. If judgment accuracy is content-general, a single measure of accuracy suffices for gaining insight into teachers' judgment accuracy across an array of content domains. Otherwise, multiple content-specific measures are necessary. A further implication concerns the ways in which the development of judgment accuracy can be fostered, that is, whether content-specific rather than content-general trainings are more effective.
To pursue this issue in a systematic manner, the present study investigates whether teachers' judgment accuracy concerning students' academic achievement is specific to different content domains within the two different subjects, mathematics and German language class. To examine judgment accuracy in different content domains and the relations among them simultaneously, we applied an innovative multivariate multilevel latent modelling approach. This approach mitigates typical methodological limitations of previous studies of teachers' judgment accuracy (see Challenges in Measuring Teachers' Judgment Accuracy section).
In the following sections, we first elaborate on teachers' judgment accuracy regarding students' academic achievement and summarise previous findings. Afterwards, we consider the contentspecificity of teachers' judgment accuracy from a theoretical perspective and report related empirical results in prior studies. Subsequently, our methodological approach is described. Finally, our research questions and hypotheses are presented.

Accuracy of teachers' judgments regarding students' academic achievement
Teacher judgments are defined as accurate when they are consistent with objective assessments of students' academic achievement (e.g., test scores; Hoge & Coladarci, 1989;Kaufmann, 2020;Ready & Wright, 2011). Student achievement is usually measured either by standardised tests or by curriculum-based measurement procedures (CBM; see Eckert, Dunn, Codding, Begeny, & Kleinmann, 2006;Feinberg & Shapiro, 2003. A commonly used measure of teachers' judgment accuracy is based on computing correlations between teachers' judgments of their individual students' achievement and the students' test performance for each single classroom or teacher, respectively, and averaging after applying Fisher's z-transformation (e.g., Helmke & Schrader, 1987;Südkamp et al., 2012;Urhahne & Wijnia, 2021).

Content-specificity of teachers' judgment accuracy
Teachers' judgment accuracy of students' achievement is often implicitly conceptualised as content-general (Herppich et al., 2018). At the same time, some research evidence suggests that accuracy is specific to the content domain being judged (Karst, 2012).
First, in schools, learning is in large part content-specific (Baumert, Lüdtke, Trautwein, & Brunner, 2009;Seidel & Shavelson, 2007). The teaching and learning contents are typically structured around subjects, which consist of highly associated content domains that are nevertheless psychometrically distinguishable, suggesting, therefore, the use of domain-specific assessments (Brunner, 2006;Harks, Klieme, Hartig, & Leiss, 2014;Lonigan & Milburn, 2017). Accordingly, teachers need to judge students' achievement not only in domains specified at the subject level but also, and more importantly, in content domains within subjects. This in turn enables teachers to gain deeper insight into students' understanding and provide appropriate learning opportunities (Artelt & Rausch, 2014;Brunner, Anders, Hachfeld, & Krauss, 2013;Seidel & Shavelson, 2007;Shulman, 1987). Furthermore, results from interview studies on teachers' decision-making in lesson planning show that teachers focus their lesson planning on the specific content to be taught. In doing this, they take into account their judgments of students' respective abilities (Morine-Dershimer, 1978e1979;Randi & Corno, 2005;Shavelson & Stern, 1981).
Second, for content-specific judgments to be accurate, content knowledge (CK) and pedagogical content knowledge (PCK) are regarded as a basic prerequisite (Herppich et al., 2018;Shulman, 1987;Thiede et al., 2015). However, teachers can at the same time show strengths and weaknesses with regard to the content domains within a subject. For example, it has been shown that teachers may have sound knowledge of geometry while 1 Academic domains are often defined broadly as subject matter or discipline (e.g., mathematics). They can also be defined more specifically with respect to distinct content areas within a subject, such as algebra in mathematics (Harks et al., 2014). In the present study, we will use the term content domain to refer to distinct content areas within a subject and subject domain to refer to the subject as a whole. Domain will be used to refer to both subjects and content areas.
D. Kolovou, A. Naumann, J. Hochweber et al. Teaching and Teacher Education 100 (2021) 103298 simultaneously having relative weaknesses in algebra (Bl€ omeke, Kaiser, D€ ohrmann, & Lehmann, 2010). Accordingly, the judgment accuracy of individual teachers may vary between different content domains in which students' achievement is being judged (Herppich et al., 2018). The specific nature of such knowledge and its impact on the content-specificity of teacher judgments is also supported by a recent study by Hoppe, Renkl, and Rieß (2020).The The study aimed at fostering pre-service biology teachers' ability to make "on-the-fly" judgments of students' conceptions by conveying topic-specific pedagogical content knowledge. It was found that the acquisition of pedagogical content knowledge related to a specific topic (e.g., importance of plants in ecosystems) was only effective for judgments on that topic and did not lead to better judgments on other topics (e.g., decomposition). Third, research on teacher expertise also speaks to the possibility that the ability to accurately judge students' achievement may not be generalizable across various content domains. This ability is considered to be one of expert teachers (Bromme, 2014;Leinhardt & Smith, 1985;Weinert, Schrader, & Helmke, 1990), and one that is acquired and fostered during teacher education (Dünnebier, Gr€ asel, & Krolak-Schwerdt, 2009;Hoppe et al., 2020;Van Ophuysen, 2006) as well as professional development (Thiede et al., 2015(Thiede et al., , 2018. However, considering that expertise has consistently been found to be domain-dependent, (expert) teachers cannot be assumed to be experts in judging students' achievement to the same extent across all content domains (Palmer, Stough, Burdenski, & Gonzales, 2005; see also Berliner, 1994Berliner, , 2004. To date, only very few empirical studies have addressed the specificity of teachers' judgment accuracy concerning student achievement across different domains. The study by Lorenz and Artelt (2009) primarily focused on cross-subject variation of primary school teachers' judgment accuracy. In their study, accuracy in each of two language content domains (vocabulary range and reading comprehension) and in arithmetic were weakly correlated (r ¼ 0.07 to r ¼ 0.18). Across two measurement points, teachers' accuracy within each of the two language domains was substantially correlated (ranging from r ¼ 0.42 to r ¼ 0.44). Overall, however, the correlations between and within the different content domains were at best moderate, suggesting that teachers' judgment accuracy may be specific to individual content domains. Using confirmatory factor analyses, Lintorf et al. (2011) investigated the dimensionality of teachers' accuracy in assessing the difficulty of two reading tasks with six items each. The results provided no evidence to support one-dimensionality within a task. Instead, teachers' judgment accuracy showed to be dependent on the item difficulty of each task. For example, teachers who were able to make an accurate judgment on difficult items were less accurate on easy items. Praetorius, Karst, Dickh€ auser, and Lipowsky (2011) investigated the domain-specificity of primary teachers' judgment accuracy of students' academic self-concept. In their study, they examined the correlations of judgment accuracy across different domains ("reading comprehension", "writing competence", and "mathematics") based on three different accuracy measures. Across the different accuracy measures, significant correlations were found only between the content domains "reading comprehension" and "writing competence" (ranging from r ¼ 0.38 to r ¼ 0.80), when the same judgment accuracy measures were used. Accordingly, a substantial degree of overlap between the language domains was evident.
The aforementioned studies focused on the subject-specificity of teachers' judgment accuracy, while mainly investigating primary school teachers. The only study that took into account the specificity of judgment accuracy concerning student achievement across different content domains, focused on consistency across two language content domains (Lorenz & Artelt, 2009). Initial findings from these studies suggest that teachers' judgment accuracy is specific to different subjects (mathematics and language classes) or reading tasks but not so much so to specific language content domains (e.g., vocabulary range and reading comprehension). However, the current state of knowledge about the degree to which teachers' judgment accuracy is specific to different content domains is rather limited. This is also due to methodological limitations of previous studies with respect to the measurement of teachers' judgment accuracy.

Challenges in Measuring Teachers' judgment accuracy
In previous investigations of the specificity of teachers' judgment accuracy across domains, judgment accuracy has been operationalised as the correlation between teachers' judgments and test performance for each single classroom or teacher (see Lorenz & Artelt, 2009). While this measure is common in the research on teachers' judgment accuracy, it has some significant limitations (Südkamp et al., 2012). Both teacher judgments and student achievement scores are subject to sampling and measurement error that may attenuate the correlation between teacher judgments and students' achievement (Kaiser, Südkamp, & M€ oller, 2017). In research on teachers' judgment accuracy, single-item measures of teacher judgments are common (see Südkamp et al., 2012), while achievement measures are based on a rather limited number of test items per student, leading to unreliable point estimates for both measures. Small sample sizes (n < 30), as are typical regarding the number of students judged per teacher, lead to imprecise estimates of accuracy at the classroom/teacher level (Sch€ onbrodt & Perugini, 2013; see also Praetorius, Koch, Scheunpflug, Zeinz, & Dresel, 2017). Furthermore, teachers differ in the number of students for whom judgments are made. When averaging the computed correlations across teachers or classrooms, these differences are not being weighted accordingly. Hence, the calculated mean value does not reflect the mean judgment accuracy among teachers (Dollinger, 2013). Finally, although hierarchical data structures (i.e., judgments of students nested in classrooms or teachers) are common in teachers' judgment accuracy research, they are not directly taken into account in the previous measurement of judgment accuracy. Students in the same classroom tend to be more similar in terms of their level of achievement, and disregarding such dependencies may result in too small standard error estimates and too liberal significance tests (Hox, Moerbeek, & van de Schoot, 2018;see also;Dollinger, 2013).
In order to address these methodological challenges, multilevel modelling techniques (for an overview, see Snijders & Bosker, 2012) are increasingly being used in research on teachers' judgment accuracy (e.g., Dollinger, 2013;Karst & Bonefeld, 2020;Kilday, Kinzie, Mashburn, & Whittaker, 2012;Meissel, Meyer, Yao, & Rubie-Davies, 2017;Ready & Wright, 2011). However, previous studies have been limited to to modelling accuracy for a single domain at a time (i.e. a single outcome variable). Using multilevel regression, the accuracy measure in one domain is based on the (random) slope of test performance (i.e., test scores) when predicting teachers' judgments (Dollinger, 2013;Karst & Bonefeld, 2020;Karst, Hartig, Kaiser, & Lipowsky, 2017;Meissel et al., 2017). Still, measurement error in test performance and teacher judgments, and sampling error in test performance (the predictor variable) is commonly neglected. The multilevel modelling approach with latent variables used in this study enables model-based estimations of teachers' judgment accuracy in multiple domains simultaneously, appropriate handling of hierarchical and imbalanced data structures, and the specification of latent variables to deal with measurement error comparable, for instance, to "doubly latent" analyses of contextual effects (Lüdtke, Marsh, Robitzsch, & Trautwein, 2011;Marsh et al., 2012).

The present study
The present study seeks to systematically examine the contentspecificity of secondary school teachers' judgment accuracy within each of two subjects, mathematics and German language class. More precisely, we examine the content-specificity of teachers' judgment accuracy by investigating the relations of judgment accuracy across three corresponding content domains within each subject. In each content domain within the two subjects, we focus on so-called global judgments of students' achievement. This type of judgment concerns ratings of students' overall performance in a domain and is typically examined using Likert-type rating scales (Karing, Matth€ ai, & Artelt, 2011;see also;Südkamp et al., 2012). In particular, we asked teachers to make a global rating in each content domain for each of their students. The choice of two subjects makes it possible to examine the extent to which the results can be generalised across subjects. The content domains examined are "number and variable", "shape and space", and "measures, functions, data, and probabilities" for mathematics, and "reading comprehension", "listening comprehension", and "language(s) in focus" for German language class.
To investigate the relations of judgment accuracy across multiple domains we used an innovative multivariate multilevel latent modelling approach that deals with typical methodological limitations of previous studies. In previous studies on this topic, teachers' judgment accuracy was typically operationalised as the correlation between teacher judgments and students' scores on standardised tests, calculated separately for each classroom or teacher. The domain-specificity was investigated examining the manifest intercorrelations of these accuracy measures across various content domains ( Lorenz & Artelt, 2009;Praetorius et al., 2011). However, the operationalisation of judgment accuracy as the correlation between teacher judgments and test performance is commonly criticised for being unreliable (see Challenges in Measuring Teachers' Judgment Accuracy section). Accordingly, the previously used measures of accuracy could lead to an underestimation of the true relationships of teachers' judgment accuracy across multiple domains. In our study, therefore, we implemented a multivariate multilevel latent modelling approach which accounts for multivariate correlated outcomes enabling us to appropriately model teachers' judgment accuracy within each content domain, and simultaneously to examine the latent relations across them. By doing so, we explicitly took into account the hierarchical data structure of students nested within classrooms/teachers and considered both sampling error (due to a limited number of students in each classroom) and measurement error in test scores (due to a limited number of test items per student). Our hypotheses (H 1 e H 4 ) were as follows: H1. Mathematics teachers' judgments are positively associated with their students' test performance in each mathematical content domain.
H2. Language teachers' judgments are positively associated with their students' test performance in each language content domain.
Based on previous research on the accuracy of teacher judgments (Südkamp et al., 2012), we expected positive and moderate to strong average associations between teacher judgments and test performance in each content domain.
H3. Mathematics teachers' judgment accuracy measures in different mathematical content domains correlate positively.
H4. Language teachers' judgment accuracy measures in different language content domains correlate positively.
In line with our theoretical considerations and previous findings on the content-specificity of teachers' judgment accuracy (Lorenz & Artelt, 2009;Praetorius et al., 2011), we expected that the contentspecific accuracy measures within each subject correlate at least moderately positively, but remain clearly distinguishable (i.e., correlations are not close to perfect).

Design and sample
The research project on which the present study is based was conducted with a sample of 18 public lower secondary level schools (Sekundarstufe I; part of the compulsory education) that comprised grade levels 7 to 9 from the German-speaking Swiss Canton of Zurich (see Helbling, Tomasik, & Moser, 2019 for a more detailed description of the educational context of Switzerland and the Canton of Zurich). Participation of the schools was voluntary, but within the participating schools, teachers and students were (with few exceptions) obliged to take part in the project. The project encompassed four measurement points. All students who were admitted to the seventh grade of each school in the 2016/2017 school year were drawn to participate. The content-specific measures of teacher judgments and respective measures of seventh graders' academic achievement that were used for the present study were collected at the first measurement point. Students completed computerised curriculum-based tests in the two subjects of mathematics and German language right at the beginning of Grade 7 (September/November 2016). Online questionnaires were used to collect all remaining data, including students' and teachers' demographic data as well as teacher judgments regarding their students' performance. Teachers made their judgments in December/January 2017 over a period of four weeks. At this time point, teachers and students had known each other for approximately four months.
In the present study, we limited our analyses to mathematics and language teachers and their students for whom the following data were available: (a) teachers' judgments of their students' performance in all three mathematical or language content domains and (b) standardised tests completed by the students. Both types of data were necessary for calculating the measures of teachers' judgment accuracy. When a teacher's judgments and/or students' performance on standardised tests were not available, that teacher and her/his students were not included in the analyses. In our study, this was the case for teachers of one school that participated in the research program from which data were used in this study after the students' performance was measured for the first time. In addition, there were also some teachers for whom their judgments or standardised test data of all their students were not available due to lack of participation. The resulting data set for our analyses comprised 54 mathematics teachers (out of n ¼ 63) and 55 language teachers (out of n ¼ 61) from 17 schools.
Of the nine mathematics teachers who were not included in the analyses, three did not judge their students in any content domain, four judged them in less than three domains (two of them due to teaching the same students but in different content domains), and for the remaining two teachers standardised test data was not available. Of the six language teachers who were not included in the analyses, four did not judge their students in any content domain, one judged them in less than three domains, and one teacher's standardised test data was not available.
Of the 54 mathematics teachers included in the analyses, 41% were women, and 30% had 1e5 years of teaching experience, 38% 6e10 years, 15% 16e25 years, and 17% up to 25 years (1.9% missing data). Of the 55 language teachers, 58% were women, and 28% had 1e5 years of teaching experience, 34% 6e15 years, 13% 16e25 years, and 25% up to 25 years (3.6% missing data). Twelve teachers, who were teaching both mathematics and German language in their classrooms, provided judgments for both subjects. 2 Furthermore, as a result of the aforementioned inclusion criteria, out of the project's overall student sample of 1462 students (49% female) at an average age of 13 years (SD ¼ 0.51), data of 1170 students from mathematics classrooms and 1255 students from German language classrooms were analysed. For these students, teacher judgments in all three content domains in mathematics and/or German language class were available. With regard to the standardised test results, missing data on the individual domains varied between 1.88% e 3.07% in mathematics and between 2.39% e 3.67% in German language class.

Variables
Achievement in mathematics and German language. Tests were administered in three content domains in mathematics and German language, respectively, according to the common curriculum for German-speaking Switzerland: (a) "number and variable" (i.e., arithmetic and algebra); (b) "shape and space" (i.e., geometry); (c) "measures, functions, data, and probabilities"; (e) "reading comprehension"; (f) "listening comprehension"; and (g) "language(s) in focus" (assessing knowledge in language awareness, lexis, pronunciation, grammar, orthography and language learning reflection).
In the Canton of Zurich, lower secondary school consists of two or three levels (A, B, and possibly C, depending on the respective school), with A being the most demanding level. Additionally, students are taught in mathematics and German language in separate performance-based classrooms e I, II or III e with I being the most challenging. To enable the administration of tests corresponding to students' performance in different performance-based classrooms, a multi-matrix design was used for each test with three different test booklets of varying average item difficulty: easy, medium, and hard. Common items (anchors) in all test booklets were placed within the same relative position in order to ensure comparability of test performance at both individual student-level and group-level (i.e., classroom/teacher-level). The tests for the content domains consisted of 20e25 dichotomously scored items, which showed satisfactory fit to the Rasch model (Rasch, 1960). The reliability for each dimension was generally satisfactory (WLE reliabilities: "shape and space": 0.73; "number and variable": 0.76; "measures, functions, data and probabilities": 0.74; "listening comprehension": 0.68; "reading comprehension": 0.66; "language(s) in focus": 0.73). Moreover, due to the inclusion of test performance as latent variables in our models, unreliability in the point estimates was less of a concern.
Teacher judgments in mathematics and German language. Following the commonly used approach in research on teachers' judgment accuracy, mathematics teachers and German language teachers were asked to predict the test performance of each student (Südkamp et al., 2012; see also ;Hoge & Coladarci, 1989). These judgments had to be provided separately for each of the six content domains in mathematics and German language. Prior to rating students' performance in each content domain, teachers were provided with ten (in mathematics) or seven (in German) preselected test items that were included in all test booklets from the corresponding content domain. This allowed the teachers to become familiar with the specific test content. Judgments were collected via 10-point Likert scales to allow for higher sensitivity in capturing teacher judgments (see Zhu & Urhahne, 2020). The lowest and highest response category of the rating scale were labelled as follows: "0e10%, the 10% lowest-performing students" and "90e100%, the 10% highest-performing students", respectively. Teachers were encouraged to give their individual appraisal of each student in comparison to all other students of the same grade level (i.e., seventh grade) in the Canton of Zurich. This is in line with current suggestions for comparisons beyond the classroom context, as teacher ratings are likely to be influenced by how well each student performs in relation to average performance of their classroom (Baudson, Fischbach, & Preckel, 2016; see also ;Lazarides, Viljaranta, Aunola, & Nurmi, 2018;Wright & Wiese, 1988). For example, the following instruction was used for the content domain "number and variable": "For each student, please tick the box indicating how in your estimation he or she has performed on the test for the content domain NUMBER AND VARIABLE in comparison with all other students (roughly at the beginning of seventh grade) in the Canton of Zurich".
Control variables at classroom/teacher-level. The project from which this study draws data used a quasi-experimental design with assignment of schools to the treatment and control conditions. Teachers from schools in the treatment condition attended training programs in mathematics and German language didactics. The programs were designed to sensitize teachers to the learning difficulties of low-achieving students and provide guidance for teachers to provide adequate support for these students. Thus, the programs were primarily expected to have an indirect positive influence (via enhanced teaching quality) on these students' mathematics and German language achievement. However, due to this focus on student achievement, it is possible that the participating teachers were also paying particular attention when judging the performance of their low-achieving students. Since our interest was not in these potential effects of the training programs, we decided to control for potential treatment effects in our analyses. We did this in line with previous studies on teacher judgment accuracy, which drew on data from research projects with comparable designs (e.g., Furnari, Whittaker, Kinzie, & DeCoster, 2017). Accordingly, we controlled for teaching in a treatment school using a dummy-coded grouping variable, "treatment" (0 ¼ control group, 1 ¼ treatment group). As some teachers taught in more than one participating classroom, we controlled for the "assignment of teachers" to multiple classes with another dummy variable (0 ¼ one class, 1 ¼ several classes). 19 mathematics teachers (out of n ¼ 54) and 17 language teachers (out of n ¼ 55) taught multiple classes in mathematics and German language class, respectively.

Analyses
Multilevel Modelling. As our study addresses multiple content domains simultaneously, we enhanced previous approaches to account for multivariate outcomes and regression parameter correlations on a latent level. For each subject, we specified one multivariate regression model with students i nested in classrooms/teachers j and teacher judgments Y dij as outcomes (see Fig. 1). Here, let Y dij be the judgment in the dth mathematical or language content domain for the ith student of the jth classroom/ teacher, with d ¼ {1, 2, 3} within each subject (mathematics: 1: "shape and space", 2: "measures, functions, data, and probabilities", 3: "number and variable"; German language: 1: "listening comprehension", 2: "reading comprehension", 3: "language(s) in 2 These teachers were comparable to all other teachers in terms of teaching experience, age and accuracy in the mathematical domains. Minor differences in favour of the majority of German teachers were found in the language domains. focus"). The outcome variables, that is, the teacher judgments Y dij , were z-standardised before entering the model for each subject. In each multivariate model, we added the within-teacher-component of student test performance within the corresponding content domain, q dij , as latent predictor on the student-level, while we controlled for "treatment" and "class assignment" at the classroom/ teacher-level. The resulting multilevel model for the multivariate outcomes in each subject is: for each of the k ¼ {0, 1} random student-level regression coefficients b kdj per classroom/teacher j and content domain d. That is, the regression slopes of student test performance q dij in each content domain d were allowed to vary across classrooms/teachers j, resulting in a random intercepts and random slopes regression. Accordingly, Τ is a 6 Â 6 covariance matrix comprising a) three random intercepts variances, b) three random slopes variances, as well as c) information on the covariance of intercepts and slopes across and within the three content domains per subject.
The latent predictor variable q dij , that is, a student's withinclassroom/teacher ability component, was estimated from a three-dimensional multilevel 1pl IRT (ML-MIRT) model with between-item multidimensionality (i.e., each item measures only one dimension; see Reckase, 2009). This student-level ability component is by definition group-mean centered. The ML-MIRT model for mathematics comprised the three content-specific dimensions "shape and space", "measures, functions, data, and probabilities", and "number and variable". Similarly, the model for German language class comprised the content-specific dimensions "listening comprehension", "reading comprehension", and "language(s) in focus". When examining teachers' judgment accuracy (H 1 e H 2 ), we were interested in the pure student-level effect of test performance, q dij, on teacher judgments. Thus, we standardised the slope coefficients using standardisation on the student-level as part of the model fitting procedure (within-group standardisation; see Schuurman, Ferrer, de Boer-Sonnenschein, & Hamaker, 2016). In a first step, we estimated the student-level variances of the predictors (test performance in each content domain) and outcome variables (teacher judgment in each content domain) within each Markov chain Monte Carlo (MCMC) iteration (for details on estimation, see below). The classroom/teacher-specific standardised regression coefficients were then calculated as the product of the unstandardised coefficients and the ratio of the classroom/teacherspecific standard deviations of the predictor variable and the outcome variable. Subsequently, the average standardised regression coefficient (i.e., the average accuracy across classrooms/ teachers) was estimated by calculating the average of the classroom/teacher-specific standardised coefficients in each MCMC iteration. Substantively, the estimated classroom/teacher-specific standardised regression coefficients reflect the amount of student-level standard deviations that teacher judgments will increase when the student achievement increases by one classroom/ teacher-specific standard deviation (see Schuurman et al., 2016). Due to the standardisation on the student-level, the standardised coefficients can be interpreted in a similar way to the correlation coefficients in other studies on teachers' judgment accuracy Kilday et al., 2012), while taking sampling and measurement error in the classroom/teacher-specific accuracy estimates into account.
To analyse the content-specificity of teachers' judgment accuracy (H 3 e H 4 ), we used the random-slope covariance parameters to compute latent correlations between random slopes across the mathematical and language content domains, respectively. The correlation coefficients provided information about the extent to which teachers' judgment accuracy is consistent across content domains. Lower correlations indicate a lower consistency, which in turn reflects a higher degree of content-specificity. Because we controlled for the effects of the two dummy variables, we evaluated the residual correlations rather than unconditional correlations (see also right part of Fig. 1). To gain further insight into the effects of the control variables, we also estimated the regression models without controlling for the effects of the dummy variables and compared the results with respect to the latent correlations.
To provide additional evidence for content-specificity, we compared the two subject-specific multivariate multilevel regression models to models in which the random slopes within each subject were set to be equal. That is, we checked whether teachers' judgment accuracy is multidimensional (i.e., content-specific) or predicted by latent ability on student-level in each content domain (q 1ij , q 2ij , q 3ij ). Parameters q 1j , q 2j , and q 3j denote student ability on the classroom/teacher-level, as measured by test items x1, x2, etc. Random intercepts (b 01j, b 02j, b 03j ) and random slopes (b 11j, b 12j, b 13j ) may correlate across outcomes indicating the degree of content-specificity (right part of the illustration). We entered two teacher-level variables (dummy variables), "Treatment" and "Class assignment", as classroom/teacher-level control variables. Estimation and inference. All analyses were carried out in the Bayesian framework (e.g., Fox, 2010) using MCMC estimation. As we had no reliable prior information available, we assumed vague prior distributions only. All models were estimated using four chains with 2500 samples each after a burn-in phase of 5000 samples and a thinning interval of ten (i.e., every 10th iteration was recorded).
We derived point estimates by computing the mean of the posterior distribution of each of the parameters. Additionally, we computed 95% Bayesian credibility intervals (BCI) for all parameters which indicate a statistically significant result when the BCI does not comprise zero. We tested the equality of parameters (e.g., between two standardised regression coefficients) by calculating the difference d between each pair of parameters as well as its BCI.
Following this approach, two parameters are equal if the BCI of the difference contains zero.
We assessed model convergence by visual inspection of the MCMC chains and by calculating the Gelman-Rubin R statistic (Rhat; Gelman et al., 2013) for each parameter. Convergence analyses for all model parameters showed that R-hat values are all less than 1.1, indicating that acceptable MCMC convergence was achieved. However, there was one mathematics classroom/teacher for which students' ability parameters did not converge well (i.e., R-hat > 1.1). The values of the model parameters with and without this classroom/teacher did not differ substantially, yet standard errors were larger when including this classroom/teacher. Hence, we report the results including this classroom/teacher, resulting in a more conservative way of hypothesis testing. All models were estimated using R 3.6.0 (R Core Team, 2019), JAGS 4.3.0 (Plummer, 2017), coda (Plummer, Best, Cowles, & Vines, 2006) and mcmcplots (Curtis, 2018). Table 1 provides an overview of basic descriptive statistics for teachers' judgments. In addition to the means and standard deviations, percentile ranks are also given to provide more detailed information on the ranges and distributions of teacher judgments. All variables were, on average, close to the theoretical scale mean (5.5) but exhibited a large variation. In Table 2, intercorrelations on student-level for teacher judgments and test performance between the content domains for each subject are shown. Both teacher judgments and test performance (on student-level) between the content domains of mathematics and German were strongly interrelated. The intraclass correlation (ICC), which is the proportion of variance at the classroom/teacher-level, indicated for both achievement measures (i.e., teachers' judgments and test performance) that a considerable proportion of the variability existed between classrooms/teachers (see Table 2). However, in all content domains, the ICC was higher for test performance than for the teacher judgments.

Teachers' judgment accuracy
To evaluate the degree of correspondence between teacher judgments and test performance (i.e., teachers' judgment accuracy; H 1 e H 2 ), we examined the classroom/teacher-specific standardised coefficients of test performance when predicting teacher judgments in each mathematical or language content domain (see Analyses section). We first investigated the average effect of test performance on the corresponding teacher judgments, in other words, the mean judgment accuracy among teachers. We founddas hypothesized in H 1 and H 2 da positive and statistically significant relationship between teacher judgments and test performance in both the mathematical and language content domains with the corresponding BCIs not comprising zero. Across the mathematical content domains, the classroom/teacher-specific standardised regression coefficients showed a positive relationship to teacher judgments such that, on average, an increase of one SD in test performance was associated with a 0.25 [0.21, 0.30] SD increase in teacher judgments for "measures, functions, data, and probabilities", a 0.28 [0.22, 0.33] SD increase for "shape and space", and a 0.30 [0.24, 0.35] SD increase for "number and variable". The mean of the standardised regression coefficients did not differ significantly among the three mathematical content domains Across the language content domains, the classroom/teacherspecific standardised regression coefficients showed a positive relationship to teacher judgments such that, on average, an increase of one SD in test performance was associated with a 0.30 [0.25, 0.34] SD increase in teacher judgments for "language(s) in focus", a 0.32 [0.27, 0.36] SD increase for "listening comprehension", and a 0.33 [0.28, 0.38] SD increase for "reading comprehension". The mean of the standardised regression coefficients did not differ significantly among the three language content domains ( accuracy in judging their test performance was similarly pronounced in all content domains.

Content-specificity of teachers' judgment accuracy
To test the hypotheses that teachers' judgment accuracy measures in different content domains are positively correlated (H 3 e H 4 ), the latent correlations between the random slopes across the three content domains in each of the subjects of mathematics and German language were examined. More specifically, we report on the residual correlations between the random slopes (see Analyses section). As shown in Table 3 (for mathematics) and Table 4 (for German language), all effects captured by the two dummy variablesdtreatment and class assignmentdwere found to be nonsignificant. The results of the residual correlations for both subjects are presented in Table 5. With respect to mathematics teachers' judgment accuracy, the latent correlations across the   Note. The predicted outcome is the z-standardised teacher judgment in the respective domain. Student test performance represents latent ability (group-mean-centered) on student-level in each content domain. Variances of intercepts and slopes are adjusted for the effects of treatment and class assignment. Values in square brackets indicate the 95% Bayesian credible interval (BCI). Note that the variances were rounded to two decimals. SHSP ¼ shape and space; MFDP ¼ measures, functions, data, and probabilities; NV ¼ number and variable. For the sake of comparability, we briefly report below on the correlations between the slopes derived from the multivariate regression models uncontrolled for the effects of the dummy variables (not shown in Table 5; see Analyses section). The slopes within each subject correlated significantly, and the resulting correlation coefficients were all statistically significant and varied between r ¼ 0.74 [0.59, 0.87] to r ¼ 0.82 [0.72, 0.92] for the mathematical content domains and r ¼ 0.69 [0.51, 0.84] and r ¼ 0.78 [0.65, 0.89] for the language content domains. Although the correlation coefficients were generally higher, they were not significantly different from the coefficients when adding the control variables (reported in Table 5 Additional model comparisons indicated that the multidimensional models differentiating accuracy between content domains (DIC mathematics ¼ 80,823; DIC German language: 110,940) showed better fit than the unidimensional models in each subject (DIC mathematics: 80,847; DIC German language: 111,046). The DIC differences (mathematics: d DIC ¼ 23.78, SE ¼ 21.39; German language: d DIC ¼ 105.65, SE ¼ 21.47) were greater than 10, suggesting that the unidimensional models can clearly be ruled out (Lunn, Jackson, Best, Thomas, & Spiegelhalter, 2013).

Discussion
The present study examined the extent to which teachers' judgment accuracy with respect to students' achievement is content-specific within each of the subjects of mathematics and German language class. To that end, the relationships between judgment accuracy in three mathematical content domainsd"number and variable", "shape and space", "measures, functions, data, and probabilities"dand three language content domainsd"reading comprehension", "listening comprehension", "language(s) in focus"dwere examined using a Bayesian multivariate multilevel latent modelling approach. In doing so, we explicitly took into account the hierarchical data structure as well as sampling and measurement error (Lüdtke et al., 2008(Lüdtke et al., , 2011.

Teachers' judgment accuracy
In line with our expectations, both mathematics and language teacher judgments were positively associated with test performance in each content domain. To classify the level of judgment accuracy, our results can be compared to studies that operationalise judgment accuracy as the correlation between teacher judgments and test performance calculated separately for each classroom/ teacher. This is possible because the classroom/teacher-specific standardised coefficients used to measure teachers' judgment accuracy in this study can be interpreted as correlation coefficients (see Analyses section; see also Kilday et al., 2012), although, importantly, they clearly refer to the relationships within classrooms/teachers. Compared to the average correlation of 0.63 in the meta-analysis of Südkamp et al. (2012), our analyses indicate low to medium average judgment accuracy (standardised coefficients ranging from 0.25 to 0.33) of mathematics and language teachers in all content domains. In particular, the results lie in the lower part of the correlation range (Südkamp et al., 2012;see also;Kaufmann, 2020), while no differences in average accuracy were found between the content domains of each subject.

Content-specificity of teachers' judgment accuracy
As was expected, the accuracy measures within each subject correlate strongly positively on a latent level across the content domains (r ¼ 0.57 to r ¼ 0.68). Hence, a noticeable amount of shared variance undoubtedly exists between accuracy in different content domains, indicating a relative similarity of judgment accuracy across them. Accordingly, mathematics or language teachers who make an accurate judgment in one content domain tend to form a judgment in other content domains that is not equivalent but comparable to a considerable extent in terms of its accuracy. However, in order to determine whether content domains of judgment accuracy can be empirically separated as distinguishable dimensions, it is necessary apart from model comparisons (i.e., multidimensional models versus unidimensional models) to interpret the magnitude of the latent correlations appropriately. To accomplish this, we refer to results on the separability of content domains in large-scale educational assessments like PISA (Programme for International Student Assessment; Organisation for Economic Co-operation and Development [OECD], 2019) and TIMSS (Trends in International Mathematics and Science Study; Mullis, Martin, Ruddock, O'Sullivan, & Preuschoff, 2009). In particular, we refer to results from latent correlations between content-specific dimensions based on multidimensional item response (MIRT) models. In these studies, latent correlations between contentspecific dimensions in mathematics ranged from 0.62 to 0.91 (Blum et al., 2004;Brunner, 2006;Harks et al., 2014;Klieme, 2000;Liu, Wilson, & Paek, 2008) and were interpreted as indicating empirical separability. Accordingly, the latent correlations of the accuracy measures found in our study (r ¼ 0.59 to r ¼ 0.68 for mathematics and r ¼ 0.57 to r ¼ 0.63 for German language class) provide evidence that content-specific facets of judgment accuracy can be empirically distinguished both for mathematics and German language class. Based on our results, accuracy in different content domains of a subject cannot be understood to simply reflect a common "accuracy dimension", and accuracy measures collected from different content domains cannot be used interchangeably without any reservations. Previous results on the content-specificity only exist for language class. In the study of Lorenz and Artelt (2009) similar results were found (manifest correlations of r ¼ 0.42/0.44 between vocabulary range and reading comprehension). Yet, the comparatively high latent correlations in our study could be due to methodological differences in operationalising teachers' judgment accuracy, the separability of the content domains, and the different grade levels being examined (see The Present Study section). In addition, one must keep in sight that in contrast to previous results based on manifest intercorrelations of accuracy measures, we examined the latent correlations between the accuracy measures across the content domains. It must also be noted that our study used global judgments with respect to broad curriculum-based content domains. While such content domains are psychometrically distinguishable in students' test data, they remain highly associated (the latent variable intercorrelations for students' test performance on the student-level in our study ranged between 0.87 r 0.92 and 0.87 r 0.93 in mathematical and language content domains, respectively), as they share skills and abilities which concurrently contribute and confound the learning development of each other (see Harks et al., 2014;Leinhardt, Zaslavsky, & Stein, 1990;Lonigan & Milburn, 2017). Furthermore, in the curricula, content domains are often confounded due to wording in the description of the related abilities and skills. As a result, teachers may tend to think about their students' ability in multiple content domains, although they are about to estimate students' ability only in a single content domain (Llosa, 2007). This could explain the high intercorrelations of teacher judgments within each subject that we found, which in turn may explain the high correlations between the contentspecific accuracy measures.
Reflecting on our results, it is possible that teachers' judgment accuracy regarding students' achievement in both mathematics and German language class is organised in a multidimensional structure (Gabriele et al., 2016;Karst, Dotzel, & Dickh€ auser, 2018;Lintorf et al., 2011;Spinath, 2005), which differentiates into content-specific facets. That is, content-specific facets of judgment accuracy may be nested in broad subject-specific factors reflecting judgment accuracy in mathematics and German language class, respectively. This is also supported by studies which indicate that teachers' judgment accuracy can be better described as a subjectrelated construct comprising more differentiated content-specific facets (Lorenz & Artelt, 2009;Praetorius et al., 2011). Additional support comes from the study of Hoppe et al. (2020), who showed that teachers' ability to make judgments of students' conceptions is acquired in a content-specific (i.e., topic-specific) way.
Given the multi-faceted structure that our results suggest, it is probably best to opt for content-specific measures of judgment accuracy, although the use of such specific measures depends on the research question under investigation, for example, whether the focus of a study is on a content domain or the subject as a whole. Even in the latter case, one could argue for the use of content-specific measures, since such measures are likely to provide more information than a single measure at the subject level. However, whether accuracy measures of aggregated contentspecific judgments represent teachers' accuracy at the subject level cannot be answered in the present study and requires future research. Finally, designing trainings with content-specific foci may be fruitful in fostering judgment accuracy in different content domains within a subject (see Hoppe et al., 2020;Thiede et al., 2018). In particular, focusing on improving the utilisation of cues in the process of making judgments that are more predictive to students' achievement within a domain seems to be a promising approach (Oudman, van de Pol, Bakker, Moerbeek, & van Gog, 2018;Thiede et al., 2015Thiede et al., , 2018. In such trainings, it might be of great importance to enhance teachers' content-specific knowledge so that the teachers are more likely to be able to use the appropriate cues (Artelt & Rausch, 2014;see Thiede et al., 2015see Thiede et al., , 2018. In this regard, Thiede et al. (2018), who examined the effects of different professional development programs on teachers' judgment accuracy, suggested that increasing pedagogical content knowledge may contribute to improved judgment accuracy and student achievement.

Limitations and directions for further research
The current study contributes to research on teachers' judgment accuracy as we extended previous studies on content-specificity by investigating whether secondary school teachers' judgment accuracy is specific to different mathematical or language content domains. Furthermore, we extended previously used multilevel modelling approaches to simultaneously model teachers' judgment accuracy in multiple content domains as well as the relationships among them.
Our study has, however, several limitations. First, these derive from the sample and the instruments. In the present study, we used a data set from a research project with a quasi-experimental design with an assignment of schools to treatment and control conditions. Although we controlled for potential treatment effects using a dummy-coded grouping variable at the level of the classrooms/ teachers in line with previous studies (see Furnari et al., 2017) in all our models, an impact of the treatment (a teacher training program) on our results cannot be ruled out completely. However, it may be noted that we used data from the first of four measurement occasions in the project, that is, from an early stage of the training program, when any effects of the program can be expected to be relatively small.
With respect to the instruments used, teachers were asked to rate each students' test performance on a series of 10-point scales in comparison to other students of the same grade level and region. Standardised tests, on the other hand, measure student performance based on a series of tasks. Accordingly, it cannot be ruled out that teachers, taking into account their daily interaction with their students, judge students' overall competence rather than students' test performance, which could lead to over-or underestimation in judgments (Karing, 2009). Furthermore, the judgment task used in this study (rating of achievement in a domain) can be characterized less specific than other tasks such as rankings (i.e., ranking of students in their class with respect to their achievement) or estimating the number of correctly solved items in a test (Südkamp et al., 2012). However, in the meta-analysis of Südkamp et al. (2012), no effects of the specificity of the judgment task on teachers' average judgment accuracy were found.
In addition, the teachers in this study made their judgments at least two months after the performance test, which was administered right at the beginning of seventh grade (see Design and Sample section). Accordingly, it cannot be ruled out that some students improved considerably e or, to the contrary, made less progress than should be expected e in the time span between the performance test and the teachers' judgments. Similarly, it is possible that teachers took into account the daily performance of students within this time interval when judging their students. This may have resulted in teachers rating their students higher or lower than they would have done if both measures had been collected simultaneously. Südkamp et al. (2012) considered the time gap between the collection of teacher judgments and measures of students' academic achievement in their meta-analysis. They first classified studies according to when performance tests were administrated: (a) at the same time as the teacher judgments (within a 1-month period; 73.3%), (b) at least one month after the teacher judgments (8.3%) and (c) at least one month before the teacher judgments (18.3%). Then, they investigated the moderating effect of the time gap for the 61 effect sizes included in their analysis. None of the effects of the time interval were statistically significant, that is, temporal proximity was not associated with higher judgment accuracy. Nevertheless, future studies should carefully consider the potential impact of time gaps when planning their studies.
Furthermore, although we have followed the most common approach in judgment accuracy research for measuring teacher judgments (see Südkamp et al., 2012; see also ;Feinberg & Shapiro, 2009;Hoge & Coladarci, 1989), the proximity of this measure to assessment situations in daily teaching is limited (Kaiser, Praetorius, Südkamp, & Ufer, 2017). Accordingly, future research should be devoted to the development of measures with higher ecological validity and to the investigation of their contentspecificity.
Moreover, we examined judgment accuracy in broadly defined mathematical and language content domains and their relations (see Harks et al., 2014). We investigated these relations separately for each subject. Cross-subject relations between the content domains could not be studied because only a very small subsample of teachers taught and judged (the same) students in both subjects. Future studies should, however, specifically aim to investigate cross-subject relations and plan the sampling of teachers accordingly. Besides, since teacher judgments depend on the nature of a domain, it is possible that our results cannot be generalised to other types of judgments. This could be the case for judgments that relate to domains defined at more fine-grained levels such as topic-and task-specific judgments (see Hoppe et al., 2020;Lintorf et al., 2011), or to other subjects. In this respect it is also not clear to what extent the results can be generalised to other grade levels and educational systems. For instance, the extent to which judgment accuracy is content-specific or -general might depend on the structure and contents of teacher education in the respective content domain (see Bl€ omeke, Kaiser, D€ ohrmann, & Lehmann, 2010). However, as can be deduced from research results on teacher knowledge, contentspecificity in relation to teachers should be defined more broadly than in expert research (Bl€ omeke, Busse, Kaiser, K€ onig, & Suhl, 2016). Finally, the question remains open to which extent content knowledge and pedagogical content knowledge influence the accuracy of teacher judgments in different content domains and consequently the relations across them (Herppich et al., 2018;Thiede et al., 2018).

Conclusions
We investigated the content-specificity of teachers' judgment accuracy, a rather neglected topic in research on judgment accuracy. To that end, we used a multivariate multilevel modelling approach with latent predictor variables, which represents also a methodological extension of the multilevel modelling techniques previously used in this research area. We provided empirical evidence for strongly associated, but psychometrically separable content-specific facets of judgment accuracy for both mathematics and language teachers. Therefore, depending on the focus of the study, future studies should consider this aspect when deciding how to measure and promote teacher judgment accuracy. In order to gain differentiated insights into teachers' accuracy in assessing students' performance within a subject or in a particular content area, content-specific measures should undoubtedly be preferred. More generally, researchers should carefully consider when generalizing teachers' ability to accurately gauge students' performance across domains.

Author Note
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. Special thanks go to Lena Hollenstein and Olivia Rütti-Joy for their feedback and support.