The Relationship Between Teachers' Cue-Utilization and Their Monitoring Accuracy of Students' Text Comprehension

We investigated to what extent teachers' use of diagnostic cues and the accuracy with which they interpreted or judged the values of those cues affected teachers' monitoring accuracy. Forty-six secondary education teachers judged the text comprehension of six students (216 students in total). Mere use of diagnostic cues appeared not sufficient. Rather, accurately judging the values of a diagnostic performance cue was related to higher monitoring accuracy. Using non-diagnostic student cues hampered teachers' monitoring accuracy. The key to further improve monitoring accuracy might lie in improving teachers’ ability to accurately judge diagnostic cues and help them ignore non-diagnostic cues. © 2021 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). Every student is different and thus has different needs to learn effectively. Instructional support that is adapted to these needs promotes students' learning (Author, 2010; Parsons et al., 2018). To deliver adaptive support, teachers must know what their students know (Author, 2011; Klug et al., 2013). During or in between lessons, determine what their students know by looking at students' work. Based on this, teachers adapt their instruction or lesson plan for subsequent lessons. Yet, a meta-analysis showed that teachers' monitoring accuracy of students' performance (i.e., the relation between teachers' judgments of students' performance and students' actual performance) is far from perfect and that there is much room for improvement (Südkamp et al., 2012). In the current study, we focus on this essential skill of monitoring students' performance, which is a necessary condition for delivering adaptive instruction (Author, 2019B). According to the cue-utilization framework (Koriat, 1997), people use cues (i.e., “bits of information that might potentially be drawn upon or referred to by a teacher to inform a judgment” Snow, as cited in Cooksey et al., 2007, p. 431) when making judgments. Teachers for example can deduce cues by inspecting students’ work (e.g., correctness of answers). Additionally, teachers can use information about students such as effort in class or interest in a text topic or information about the task (e.g., text difficulty or length). Using cues that are predictive or diagnostic of the judged outcome (here: text comprehension) promotes teachers' monitoring accuracy. For example, when teachers focus on students' ability to explain a text (i.e., a diagnostic cue), teachers' judgments of students' test scores are more accurate than when focusing on whether students find a text interesting (i.e., a non-diagnostic cue). Previous studies inwhich teachers were provided with information * Corresponding author. E-mail address: j.e.vandepol@uu.nl (J. van de Pol).

containing diagnostic cues, however, showed mixed results regarding teachers' monitoring accuracy. One study found no effect of access to diagnostic cues (Author, 2019) while other studies found a positive effect (Author, 2010, Author, 2010). Yet, teachers' monitoring accuracy was quite low in all studies.
Just making information available from which diagnostic cues can be deduced may be insufficient to boost monitoring accuracy. Teachers may not know what information to focus on or process information superficially (Glock et al., 2012) and therefore not actually use diagnostic cues. Moreover, even if teachers use diagnostic cues, they would have to accurately interpret or judge the actual values of those cues (i.e., 'used-cue value judgment accuracy') for their monitoring accuracy to improve. For instance, a student's ability to explain a text is a diagnostic cue, but if a teacher judges that a student can explain a text well, whereas this is actually not the case, their cue-judgment would be inaccurate.
The current study's aim is to investigate to what extent cueutilization and used-cue value judgment accuracy are related to teachers' monitoring accuracy of their students' text comprehension. Although it may seem self-evident that used-cue value judgment accuracy is related to teachers' monitoring accuracy, nothing is known yet about this relation and previous studies only focused on cue-utilization and its relation with teachers' monitoring accuracy of students' comprehension. Determining the role of used-cue value judgment accuracy is theoretically important as this aspect may need to be added to theoretical and/or process models of teacher monitoring. Additionally, it is practically important as it may shift the focus of interventions for improving teachers' monitoring accuracy from cue-utilization to used-cue value judgment accuracy. In the current study, teachers had access to students' products of generative activities they engaged in. Generative activities refer to activities that involve "actively making sense of to-be-learned information by mentally reorganizing and integrating it with one's prior knowledge (Fiorella & Mayer, 2016, p. 717). Engaging in such activities generates diagnostic cues for students and teachers (Author, 2014. Such activities, such as making drawings or completing diagrams about a text, are common practice in education (cf. Fiorella & Mayer, 2016). In the current study, teachers viewed students' completed diagrams, as these concisely represent students' text comprehension.
When studies measured cue-utilization, they mostly did so by calculating the correlation between cue values and judgments (e.g., Author, 2014;Schleinschok et al., 2017). Schleinschok et al. (2017), for example, calculated correlations between characteristics of drawings that students made about texts (e.g., idea units) and judgments about students' text comprehension to express cueutilization. However, teachers have no access to these cue values; they also judge these values. So relating actual cue values to teachers' judgments of students' text comprehension may not necessarily express their cue-utilization. Especially given that teachers often overestimate their students, this correlational measure may overstate their cue-utilization. Therefore, we used teachers' self-reported cue-utilization. For this purpose, we compiled a cue-list based on think-aloud data of previous studies (Author, 2018; Table 1) and complemented with cues from the literature (Bennett et al., 1993;Cooksey et al., 2007;Dinsmore & Parkinson, 2013;Dusek & Joseph, 1983;Jenkins & Demaray, 2016;Mizala et al., 2015;Rausch et al., 2015;Weaver & Bryant, 1995).

Cue-diagnosticity
A cue is highly diagnostic when the relationship between actual cue values (e.g., commissions in students' work) and judged outcomes (e.g., students' text comprehension test score) is strong. Kostons and de Koning (2017), for example, showed that elements and details in students' drawings about texts were diagnostic of students' test performance. Moreover, drawings in their experimental condition, which aimed at and resulted in improved monitoring accuracy, contained more of these diagnostic cues than drawings in the control condition in which monitoring accuracy was lower. Measuring cue-diagnosticity can thus help explaining monitoring accuracy differences.
Generally, performance cues seem most diagnostic (e.g., Author, 2010B; 2019B; Griffin et al., 2009). Next to using diagnostic cues, non-diagnostic cues should be ignored. Using vignettes manipulating cue availability, Kaiser et al. (2015) showed that teachers' judgments of students' mathematics achievement were more accurate when they only had (diagnostic) performance cue values available (i.e., oral/written mathematics achievement) than when they additionally had student cue values available (e.g., students' gender, intelligence). When monitoring their own students' mathematics performance, teachers were also most accurate when having only diagnostic performance cues available (by providing teachers with anonymized student work) instead of only student cues or performance and student cues (Author, 2018).
Two studies that directly measured the performance cues' diagnosticity by relating actual cue values to students' test scores, showed that some performance cues are highly diagnostic whereas other performance cues are not (Author, 2014. Specifically, correct causal relations in students' diagrams was highly diagnostic (r ¼ 0.40-0.50); commissions and factual information in students' diagrams had low diagnosticity (r ¼ À0.15 to À0.25 and r ¼ À0.09 respectively). This suggests that teachers' judgments of students' performance would be more accurate when using causal relations but not when using commissions.

Teachers' cue-utilization and used-cue value judgment accuracy
Few studies measured teachers' cue-utilization. Think-aloud analysis showed that the higher teachers' use of diagnostic performance cues and the lower their use of non-diagnostic student and task cues, the higher their monitoring accuracy of students' performance was (Author, 2018). However, teachers' utilization of (non-)diagnostic cues could not explain differences in teachers' monitoring accuracy in another study (Author, 2020). This study found that teachers' monitoring accuracy of students' text comprehension was lower when having only performance cues available compared to having performance and student cues available. This finding was surprising as analyses of the think-aloud protocols showed that when only having performance cues available, teachers used up to 25% more (diagnostic) performance cues than when having performance and student cues available. Further analyses, however, suggested that, even though teachers used diagnostic cues, they had difficulties in accurately interpreting or judging cues that could be derived from the diagrams (e.g., correct relations). Thus, correct cue interpretation may also play a role (cf. Funder, 1999). To further explore this, we asked teachers to judge cue values of used cues. We compared these cue-value judgments to the actual cue values to compute the used-cue value judgment accuracy. To the best of our knowledge, this is the first study to Ongoing Engagement Subdomain scale (IRRE, 1998).
.76 I pay attention in the lessons of teacher X (5; 1 (totally disagree) to 4 (totally agree)) Student precision when working on assignments/tests (tidy/systematic) Big Five conscientiousness scale (Goldberg, 1992) .86 To what extent do you show the following traits in class of teacher X: precision (6; 1 (not true at all) to 7 (entirely true)) General knowledge and skills students 1.00 The topic of this text is fascinating to me. (4 per text; 1 (not at all true) to 5 (very true)).

General personal characteristics
Extraversion: How talkative/active is this student generally in class?
Big Five extraversion scale (Goldberg, 1992) .89 To what extent do you show the following traits in class of teacher X: quietness (reverse coded) (6; 1 (not true at all) to 7 (entirely true)) Degree of self-efficacy (certainty/ self-confidence) with regard to school work for teacher's subject.
Perceived self-efficacy scale (bib_Marsh_et_al_2006Marsh et al., 2006 .83 I'm certain I can understand the most difficult material presented the study materials of subject X (4; 1 (almost never) to 4 (almost always)). The student's gender Student self-report NA NA Learning problems student (dyslexia, adhd, add, autism, giftedness, dyscalculia, Dutch as second language) Student self-report NA 0: student does not have the learning problem//1: student has the learning problem Nationality student: Based on birth country student/mother/father Student self-report NA 5: student, mother and father born in the Netherlands (NL)//4: student and mother or father born in NL//3: student born in NL, mother and father not//2: student not born in NL, mother and father born in NL//1: student not born in NL, mother or father born in NL//0: student, mother and father not born in NL. Mental capacity Student's IQ Raven standard progressive matrices (Bilker et al., 2012) .54 (9 items; per item 6 or 8 answer options)  (Landis & Koch, 1977). b The diagnosticity and used-cue value judgment accuracy is based on students' grades for Math, Science, and English. relate teachers' used-cue value judgment accuracy to teachers' monitoring accuracy of students' performance.

The current study
To better understand how cue-utilization relates to teachers' monitoring accuracy, we examined cues that teachers used to judge students' text comprehension and their judgments of the cue values. Teachers completed three conditions in a within-subjects design. First, teachers only judged students' performance (performance-only condition). Then, teachers judged students' performance and selected used cues from a list (judgment þ cue-list condition). Finally, they judged students' performance, selected cues, and rated the perceived cue values.
Forty-six secondary school teachers judged their students' text comprehension while having various information sources available from which they could deduce cues: students' completed diagrams about causal relations in each text (giving access to performance cues), students' names (access to student cues), and the texts and test (access to task cues). Teachers had these information sources available in all three conditions. A special feature of this study is that we measured the actual values of all included cues (e.g., students' IQ, correct relations in students' diagrams, and text characteristics). This is firstly useful for future research that may use the cue diagnosticity. Additionally, measuring diagnosticity enables us to: (1) take the actual cue-diagnosticity for this sample into account in interpreting our results, and (2) measure how accurately teachers judge cues by relating the actual cue values to teachers' cue judgments. We address the following research questions: RQ1: To what extent are a wide range of performance, student, and task cues (cf. Table 1) diagnostic of students' text comprehension? Based on previous research (Author, 2014, 2019bib_Author_2010bib_Author_2010), we expect that correct relations in students' diagrams are highly diagnostic (r > 0.50) whereas commissions and factual information in the diagram are low diagnostic (r < 0.30). The diagnosticity of other cues is explored. Moreover, we expect that performance cues are more diagnostic than student and task cues.
RQ2: What cue-use patterns occur when monitoring students' text comprehension? A cue-use pattern is the constellation of cues used for a judgment consisting of one or several cues. Based on Author (2020), we expect that teachers use e on average esix cues per judgment and mostly use performance cues, followed by student, and task cues. Cue-use patterns are explored.
RQ4: To what extent do cue-utilization and used-cue value judgment accuracy relate to teachers' monitoring accuracy of students' text comprehension? We expect that when teachers use highly diagnostic cues and judge these cues accurately, their judgments of students' text comprehension is most accurate.

Participants and design
Forty-six secondary education teachers of subjects for which text comprehension is important (e.g., languages, history/ geography) participated (64% female; 94% Dutch). The sample-size was based on a multilevel a-priori power analysis (power ¼ .80) conducted in spa-ml (Moerbeek & Teerenstra, 2015). Teachers had known their classes for 10.64 months on average (SD ¼ 6.39) 1 and had e on average e 12.5 years of teaching experience (SD ¼ 7.92). They received a V50 voucher for participation.
The study had a within-subjects design, with all teachers judging three students' text comprehension under three conditions in the following order: judgment-only; judgment þ cue-list; judgment þ cue-list þ cue judgment. For each condition, teachers judged three of their students' comprehension and made separate judgments for each text read by students. Overall, teachers made 405 judgments (135 students*3 texts) in the judgment-only condition, 405 judgments (135 students*3 texts) in the judgment þ cue-list condition, and 408 judgments (136 students*3 texts) in the judgment þ cue-list þ cue judgment condition. 2 Although there were three conditions, only the judgment þ cuelist and judgment þ cue-list þ cue-value-judgment condition provided teachers' self-reported cue-utilization data, which was this study's focus. The judgment-only condition was implemented to check whether explicating cue-utilization and judging cuevalues was related to teachers' monitoring accuracy. There were no significant differences between conditions regarding students' test scores, teachers' judgments, and teachers' monitoring accuracy in terms of deviation or bias (all p's > 0.05; see Table 2 for M's and SD's). In the current study, we only used data of the judgment þ cue-list and judgment þ cue-list þ cue-value-judgment condition (261 students; M age ¼ 15.15, SD ¼ 1.37; 50.7% female; 93.8% born in the Netherlands).
Students whose text comprehension was judged were selected based on their general reading comprehension test scores (see student measures). For each condition, we selected a student with low (z20th percentile), medium (z50th percentile), and high (z80th percentile) scores. Within each condition, the order in which these three students were judged was randomized. This study received approval from the ethics review board of the first author's institute.

Expository texts
Students read three texts, derived from the study by Author (2019B). The topics of the texts were "Music makes smart" (167 words), "Sinking of metro cars" (158 words), and "Concrete constructions" (166 words). Each text contained five clauses conveying causal relations (see Appendix for instructions).

Student diagrams
After reading, students completed diagrams. For these diagrams, students were asked to write down the text's cause-and-effect relations (see Appendix for instructions). Please see Fig. 1 for an example. Students did not receive feedback on the quality of their diagrams. Coding of the diagrams, information about the interrater 1 There was no effect of the number of months the teacher knew their class on their judgment accuracy of student characteristics in our data; test results can be requested from the first author.
2 For some teachers, there were not enough students available (due to illness or because they declined participation); in the judgment-only condition, three teachers made judgments about two students, in the judgment þ cue-list condition, one teacher made judgments about two students and one teacher about one student, and in the judgment þ cue-list þ cue-judgment condition, two teachers made judgments about two students.
reliability and an example can be found in the section 'Performance cue values e diagram cues'.

Text comprehension test
For each text, students completed a test question. Students were asked to describe (in text format) the causal relations in each text. They were provided with one of the causes or effects for each question and with signaling words that they could use to make the order of the causes and effects clear (e.g., 'for that reason', 'first'). See Appendix for instructions and an example question.
For scoring students' answers, we used an existing answer format (cf. Author, 2014; 2019). The answer format was straightforward, as it consisted of the correct cause-and-effect elements and the order of these elements as represented in the texts. Students were assigned one point for each correct element that was detected in their answer (range per text: 0e4). They did not get points for copying the provided element. Data of 50 students was double coded by two assistants and the interrater reliability was substantial (Krippendorff's alpha: 0.93). Additionally, we determined the number of correct combinations of two elements (i.e., the number of correct relations per text: 0e4) (Krippendorff's alpha: 0.88). The total test score was the sum of correct elements and relations (range: 0e8). The reliability of the test was acceptable (a ¼ 0.73). Furthermore, the test seemed to validly measure students' understanding of causal relations in the text. That is, just reproducing information from the text would not result in a high comprehension score; the students had to show actual understanding of the link between the causes and effects by describing them in the right order to obtain points. This is substantiated by high correlations between students' test and diagram scores indicating students' understanding of causal relations in the texts such as the correct relations (r ¼ 0.96) and the correct elements (r ¼ 0.91). Table 1 summarizes the most important information about each instrument used to measure the actual cue values of the performance, student, and task cues. Additional information on some instruments is provided here. To assess the quality of instruments measuring knowledge and understanding (e.g., general reading comprehension, reproduction test, prior knowledge), we used three quality indicators: question difficulty, discrimination, and reliability (Van Berkel & Bax, 2006;Van den Brink & Mellenbergh, 1998). If an instrument performed below par on !2 indicators, we excluded the variable.

Actual cue values
Performance Cue Values e Diagram Cues. We coded students' diagrams to measure diagram cues, using an existing answer format (cf. Author, 2014; 2019). First, the facts in the diagrams were coded. The answer format contained a list of facts and facts pertained to details in the text that were not essential for understanding the cause-and-effect relations. Each fact was assigned 0 (incorrect/not mentioned in the text) or 1 (correct). Three assistants coded 60 diagrams (Krippendorff's alpha: 0.99).
Second, we coded diagram elements (i.e., causes/effects). Elements were coded as correct when matching the answer format (0e4 per text) or as commission when an element in a student's  Kamalski, 2007). The cloze test consisted of an expository text derived from Author (2014) that was comparable in length and difficulty to the main texts of this study.
In the text (215 words), 20 words were omitted and students had to complete missing words. The test was piloted and items that were too easy were replaced. The item difficulty varied and the test was not too easy or too difficult; the percentage of students that answered an item correctly ranged from 9.7% to 92% with an average of 62% (SD ¼ 22%; cf. Van Berkel & Bax, 2006;Van den Brink & Mellenbergh, 1998). The item-rest correlations of all items were sufficient for 14 of the 20 items (M ¼ 0.18; SD ¼ 0.08), indicating that the items discriminated well between students with low and high test scores (cf. Van Berkel & Bax, 2006;Van den Brink & Mellenbergh, 1998).
Student Cue Values -Students' Ability to Reproduce Facts. We used a text and test items for measuring students' retention of facts (Author, 2014). Two assistants coded 90 students' answers (Krippendorff's alpha: 0.98). The item difficulty varied and the test was not too easy or too difficult; the percentage of students that answered an item correctly ranged from 18.8% to 76.8% with an average of 51% (SD ¼ 27%). The item-rest correlations of all items were sufficient for four of the five items (M ¼ 0.17; SD ¼ 0.02).
Student Cue Values -Students' IQ. Although the shortened Raven Progressive Matrices test showed high internal consistency in previous research (Bilker et al., 2012), Omega was moderate in our sample (0.54). Overall, the item difficulty varied (M proportion correct adjusted for chance ¼ 0.63, SD ¼ 0.23, range ¼ 0.19 -0.86) and items were not too difficult or too easy given that all items had pvalues above chance. The item-rest correlations of all items were sufficient to very good (M ¼ 0.27; SD ¼ 0.07).

Judgments of students' text comprehension
Per text, teachers indicated how many points they thought each student scored (0e8). The information they could use were: students' completed diagrams (performance cues), information they knew about their student (student cues), information they remembered about the expository texts and test (task cues).

Cue-Utilization
Teachers were asked 'What did you base your judgment upon? Please be as complete as possible'. They received a cue-list (Table 1, columns 1e3) and the experimenter explained each cue. Additionally, an explanation of the meaning and measurement of each cue was available (not printed in Table 1) to ensure teachers interpreted the cues as intended.
We piloted the cue-list with two teachers resulting in a cue-list of 28 items. That is, additional to the cues in Table 1, four cues were originally present on the cue-list but were omitted from our analyses. Two cues appeared redundant (text difficulty and readability), one turned out to be impossible to score reliably (spelling/ grammatical mistakes), and the instrument to determine the actual cue value of prior knowledge was insufficient regarding all three quality indicators.
The order of the cue types (i.e., student, performance, and task cues) on the cue-list was systematically varied using a Latin-square design, resulting in six versions. There were no significant differences in cue-utilization, judgment height, and monitoring accuracy between versions (all p's < 0.05). For each judgment, teachers indicated which cue(s) they used (0 ¼ not used; 1 ¼ used). Finally, to check whether using a cue-list did not affect teachers' cue-utilization, we compared teachers' cue-utilization to teachers' cue-utilization in a previous study using a think-aloud procedure without a cue-list (Author, 2020). In Author (2020), teachers mostly focused on the completeness (e.g., do diagrams contain all necessary elements) and correctness of the diagrams (elements and relations). As this is highly similar to our results, there does not seem to be a reason to assume that providing teachers with a cue-list affected their cue-utilization.

Cue judgments
In the judgment þ cue-list þ cue-value-judgment condition, teachers also made judgments for used cues. If a teacher for example indicated that they used student interest and IQ, they were asked to answer the interest scale for this student and to indicate how many questions of the Raven standard progressive matrices the student answered correctly. For all student cues for which we used self-report scales (e.g., student interest) or student tests (e.g., general reading comprehension level), teachers viewed the questions of the scales/test and if relevant (e.g., on the Raven test), correct answers. The minimum/maximum cue judgment values corresponded to the minimum/maximum of the instruments used to measure cue values. For cues for which the actual cue values were obvious, teachers did not estimate the cues (i.e., student's gender, omissions, text length and position, time to complete diagram, and mean number of words in the diagram boxes). For learning problems, teachers indicated whether students had (1) or did not have a learning problem.

Students
Both sessions took place in a computer room at the participants' school during a lesson period, with the whole class present. Students completed the tasks individually at their own pace on a computer in two sessions (Fig. 2). Although the teacher was present, a researcher led the session and made sure students worked in silence on the tasks. In session two, students practiced reading and diagramming guided by a movie clip. During practice, they read two texts, completed two diagrams, and two test questions. Additionally, they compared their answers to an answer model. The movie clip contained explanation on how the task worked and provided and discussed the answer models.

Teachers
The teacher part took place in individual sessions, scheduled after student session 2 was completed. After providing general information, teachers read the students' instructions about the reading tasks and test, including example test questions (Fig. 3). Teachers read the three texts and judged students' text comprehension. After having made judgments for each student and each of the three texts, teachers gave restudy rankings, indicating in what order each student should restudy the texts and indicated how they thought the students judged their own test score. 3 Teachers were asked to think out loud while making judgments. In all conditions, teachers were provided with information from which they could deduce cues: 1) students' completed diagrams about the texts (performance cues), 2) students' names (student cues), and (3) the task (task cues). Teachers first made all judgments in the judgmentonly condition, then in the judgment þ cue-list condition, and finally judgment þ cue-list þ cue-value-judgment condition to prevent carry-over effects. Within each condition, teachers always started with a 'practice student' to get familiar with the procedure of the condition. In the judgment-only condition, they practiced the procedure with the practice student for all three texts. That is, they made judgments about students' text comprehension for each of the three texts and then they made a restudy decision. In the judgment þ cue-list condition and the judgment þ cue-list þ cuevalue-judgment condition, teachers practiced the procedure with a practice student, but only for one text because of time constraints. Because they only practiced with one text, they did not make restudy decisions for the practice student in these two conditions.
In the judgment þ cue-list condition, teachers e in addition to making judgmentse also indicated which cues they had used for their monitoring judgments (see Fig. 3). In the judgment þ cuelist þ cue-value-judgment condition, teachers e in addition to making judgments and indicating cue-use e also judged the values of the used cues. Data of the practice students was not included in the analyses.

Monitoring accuracy
We used bias and absolute accuracy as indices of teachers' monitoring accuracy. Bias was calculated by subtracting a student's test score from a teacher's judgment. Scores range from À8 to þ8; scores closer to zero indicate more accurate judgments, negative scores indicate underestimation, positive scores overestimation. Absolute accuracy is the absolute difference between teachers' judgments and students' test scores. Scores range from 0 to þ8; scores closer to zero indicate more accurate judgments.

Cue-diagnosticity
To measure cue-diagnosticity, we calculated correlations between actual cue values (cf. Table 1 for instruments) and students' test scores. For those cues for which a negative value meant high diagnosticity (i.e., omissions and commissions, number of difficult words in a text, learning problem, text position and length), we used the unsigned correlation. A cue was highly diagnostic when cue values highly correlated to students' test scores.

Used-cue value judgment accuracy
In the judgment þ cue-list þ cue-value-judgment condition teachers were also asked to judge the values of the cues used. We calculated the cue-value judgment accuracy in terms of bias; for this we subtracted the actual cue value from the judged cue value. So if a teacher for example judged that a student had two correct facts in their diagram, whereas this student had, in reality, five correct facts, the bias score was 2e5 ¼ À3, meaning that the teacher underestimated the number of correct facts. We also calculated absolute judgment accuracy by calculating the absolute difference between teachers' judgments of cue value and actual cue values. So for the aforementioned example, the absolute cue-value judgment accuracy would be three (5e2). Because scales differed per cue (Table 1), the range of used-cue value judgment accuracy varied per cue. In our analyses, we used z-scores.

Analyses
For RQ1 (cue-diagnosticity), we provide correlations between actual cue values and students' test scores. Regarding RQ2 (cueutilization), we provide descriptives and occurrences of cue(s) used for single judgments (i.e., cue-use pattern). We restricted the description to those cues-use patterns that were used in !10% of the judgments. For RQ4 (relation cue-utilization and used-cue value judgment accuracy and monitoring accuracy), we used multilevel analysis (judgment (level1), student (level 2), teacher (level 3)). Teachers only judged cues that they had used; therefore there were many 'missing' values for the judgments of those cues that were not used. For some cues, cue judgments were missing for as many as 97.9% of the cases (e.g., student's nationality; this cue was thus seldomly used). For the used-cue value judgment accuracy model, we only selected cues that had less than 60% missing values (cf. Table 3). For the cue-utilization model, we included all cues.

Results
Generally, teachers overestimated students' test scores with 1.15 points and their judgments deviated, in an absolute sense, 2.19 points on average from students' actual test scores (cf. Table 2).

Cue-diagnosticity (RQ1)
As expected, performance cues were, on average, more diagnostic than student and task cues (Table 3). Task cues had the lowest diagnosticity. Yet, within cue categories, we saw substantive variation. As expected, the performance cue 'number of correct facts 4 ' was hardly diagnostic (0.08), whereas the 'number of correct relations' was highly diagnostic (0.59). Another highly diagnostic cue was the 'number of correct elements' (0.63). The cue 'omissions' was somewhat less diagnostic but still moderately to strongly correlated to students' test scores (0.45). All student cues had low diagnosticity (all < 0.30). Within this category, 'general reading comprehension level' (0.25) and 'IQ' (0.22) were relatively most diagnostic. All task cues had low diagnosticity.

Cue-utilization (RQ2)
Per judgment, teachers used on average 6.35 cues (SD ¼ 3.94) with a minimum of 1 and maximum of 24 (out of 28) cues. On average, they used 3.19 diagram cues (SD ¼ 1.77), 2.25 student cues (SD ¼ 2.30), and 0.92 task cues (SD ¼ 1.33) per judgment. In many of their judgments, teachers used cues that were highly diagnostic. For example, they used correct elements, relations, and omissions in over 50% and students' general reading comprehension level and IQ in over one third of their judgments. Yet, two low diagnostic cues (i.e., correct facts and students' effort) were also used relatively often; in about two third and one third of the judgments, respectively. Differences in cue-utilization were small between conditions (see Supplemental material).
For the total of 813 judgments we encountered 456 unique cueuse patterns, occurring between 1 and 28 times. 5 The patterns occurring >10 are reported in Table 4. The fact that the most common pattern ei.e., omissions, correct facts, elements, and relationse was only used 28 times (in 813 judgments; i.e., in 3.44%), indicates that there was not a single cue-use pattern that stood out. Seven out of eight cue-use patterns in Table 4 consisted of performance cues only and mostly included correct relations, facts, and elements (6 out of 8 patterns). Furthermore, teachers sometimes only used one cue, that is, omissions.

Teachers' used-cue value judgment accuracy (RQ3)
Teachers mostly struggled with accurately judging performance cues; their judgments deviated on average around 30% from actual cue values. They mostly overestimated cue values (Table 3). Correct elements, for example, which was a highly diagnostic cue, was overestimated by about 30%. Correct facts (low diagnosticity) was overestimated by about 50%. As for student cues, teachers' used-cue value judgment accuracy differed between cues. For some student cues, judgments were remarkably accurate (e.g., conscientiousness: 0.29%; grades other subject: 3.3%; student's interest: 0.33%) whereas for other student cues, judgments were quite inaccurate (e.g., students' ability to reproduce facts: overestimation of 33%; student's nationality: teachers thought the student and/or their parents were non-Dutch whereas they were [28.8%]). Teachers judged the number of facts in the text (task cue) relatively accurate (2.67% deviation) but overestimated the number of difficult words in the text by about 24%.

Cue-utilization and used-cue value judgment accuracy vs. monitoring accuracy (RQ4)
The majority of variance in teachers' monitoring accuracy of students' test scores was situated at the judgment level (bias: 73%; deviation: 81%). Smaller parts of the variance resided at the teacher (bias: 11%; deviation; 3%) and student level (bias: 16%; deviation: 16%). Monitoring accuracy thus mainly varied from judgment to judgment.
When teachers used omissions as a cue, their monitoring accuracy (deviation) was higher (Table 5). In contrast, using students' general reading comprehension levels, grades for other subjects, nationality, extraversion, and IQ was related to more overestimation (bias). When teachers judged the correct relations in students' diagrams and students' general effort levels in class more accurately (deviation), their monitoring was more accurate (deviation and bias; Table 5).

Discussion
We investigated teachers' monitoring accuracy of students' text comprehension. Students completed pre-structured diagrams representing causal relations in the texts they had read. While judging students' text comprehension (i.e., test performance), teachers had access to these diagrams (giving access to performance cues such as correct relations and omissions in students' diagrams), and to students' names (giving access to student cues such as IQ and gender). They had also read the texts and seen example test questions beforehand (giving access to task cues such as text length and text position). We explored how diagnostic a 4 The number of correct facts refers to elements that are not essential for the causal relations and that were thus not part of the test. 5 We restricted ourselves to cues that were used in !10% of the judgments (cf. wide range of performance, student, and task cues were for students' text comprehension (RQ1), what patterns in teachers' cue-utilization could be observed (RQ2), and how accurately teachers could judge the values of the cues they had used (used-cue value Table 3 Cue-diagnosticity, teachers' self-reported cue-utilization, actual cue values, teachers' cue judgments and teachers' used-cue value judgment accuracy per cue. Note. Cue-diagnosticity: min ¼ À1 meaning low diagnosticity, max ¼ þ1 meaning high diagnosticity; cue-utilization: min ¼ 0, max ¼ 1; used-cue value judgment accuracy: closer to 0 is more accurate. Cue-utilization is coded as 0 (not used) or 1 (used); the mean indicates proportion of judgments for which the particular cue is used. a Calculated as: ((cue judgment e actual cue value)/nr of scale points)*100. A positive value indicates that a teacher's overestimation of the cue value and a negative value indicates underestimation. If the max for a cue was ∞, we used the maximum of the teachers' cue judgment. b Mean percentage in absolute sense. c To calculate teachers' used-cue value judgment accuracy for learning problems, we only considered the combination of a teacher who indicated that (s)he used this cue (score ¼ 1) with that student actually having a learning problem (score ¼ 1) as accurate. Cases in which the teacher did not use it and the student did not have it were not counted as accurate because not using it was the default value for this teacher variable; we did not ask the teacher to explicitly judge whether or not the student had each learning problem, we only asked whether they used it.

Cue-diagnosticity (RQ1)
Monitoring accuracy is considered to depend on how diagnostic used cues are, that is, how predictive they are of test performance (Koriat, 1997). However, cue-diagnosticity is often not measured. By measuring actual cue-values we could determine cue-diagnosticity. Overall, performance cues were most diagnostic, then student cues, followed by task cues. As expected, the number of correct relations in students' diagrams was highly diagnostic of students' test scores (cf. Author, 2014;. Correct elements and omissions in students' diagrams were moderately to highly diagnostic. Importantly, not all performance cues were diagnostic; as expected, correct facts in students' diagrams, which was used in many teachers' judgments, had low diagnosticity as did commissions in students' diagrams. All student and task cues had low diagnosticity. These findings substantiate the widely held assumption that performance cues are highly diagnostic, and more diagnostic than student and task cues. However, the variability in the diagnosticity of performance cues shows that caution is needed when designing interventions to improve teachers' monitoring accuracy. Only the use of certain performance cues (here: relations, elements, and omissions) should be promoted, based on their actual diagnosticity for the to-be-judged task.

Cue-utilization (RQ2)
To gain more insight in the judgment process, we investigated the number, type, and patterns of cues used. The number of cues used and the extent to which each cue-type was used, was similar to findings of Author (2020), p. 6.35 cues were used on average per judgment and teachers mostly used performance cues, then student and then task cues. The cues with the highest diagnosticity (correct elements, relations, and omissions) were used in the majority of judgments. Yet, teachers also used performance and student cues with low diagnosticity (i.e., facts in students' diagrams, students' effort in class, grades for the teacher's subject, general reading comprehension level, IQ) to a considerable extent, even though they were made aware that they had to judge students' test scores and that the test was about text elements and relations. We found as many as 456 unique cue patterns on a total of 813 judgments and there was not a single pattern that stood out for being used often. However, the most frequently used patterns only or mainly contained performance cues.
These findings show that teachers draw upon quite some information when making judgments, including non-diagnostic information. Future research could investigate whether teachers' monitoring accuracy would improve from encouraging them to limit the number of cues they use and focus on diagnostic Table 5 Model results for multilevel models of teachers' judgment accuracy of students' test scores predicted by cue-utilization and used-cue value judgment accuracy (unstandardized coefficients).

Cue-utilization
Used-Cue Value Judgment Accuracy Note. For the used-cue value judgment accuracy model, we only selected cues that had less than 60% missing values. a Dev/dev ¼ used-cue value judgment accuracy deviation score (IV) and judgment accuracy of students' text comprehension deviation score (DV). b Bias/bias ¼ used-cue value judgment accuracy bias score (IV) and judgment accuracy of students' text comprehension bias score (DV). c Dev/bias ¼ used-cue value judgment accuracy deviation score (IV) and judgment accuracy of students' text comprehension bias score (DV). d Bias/dev ¼ used-cue value judgment accuracy bias score (IV) and judgment accuracy of students' text comprehension deviation score (DV).
performance cues.

Used-cue value judgment accuracy (RQ3)
For accurate monitoring, focusing on diagnostic cues and ignoring non-diagnostic cues may be a necessary but not sufficient condition: Teachers should also accurately judge the value of the used (e.g., judge how many relations students completed correctly in their diagram). Teachers' judgments of performance cues e which had the highest diagnosticity e appeared to be least accurate; teachers, on average, overestimated these cue values by 30%. Two highly diagnostic cues (correct relations and elements) were, respectively, overestimated by 31% and 27%. This overestimation is in line with what we generally see in the literature about teachers' judgments of students' achievement (Südkamp et al., 2012;Urhahne & Wijnia, 2021). A possible explanation for this may be that teachers did not use the same standards as we in deciding whether relations or elements was correct. Yet, the correct answers were rather straightforward as the texts contained the correct elements and relations and the teachers knew the texts. Perhaps, teachers suffered from the leniency effect as suggested by Urhahne and Wijnia (2021). That is, teachers may "not take sufficient account of factors such as students' forgetting of subject matter, limited testing time, lack of effort, excitement, and test anxiety (Hosenfeld et al., 2002)." (Urhahne & Wijnia, 2021, p. 6). Therefore, even when particular cues are easy to judge, other factors may still distort teachers' judgments. In addition to not taking into account particular factors, teachers may also have taken non-diagnostic student cues into account when judging students' diagrams, which may also have hampered their cue judgment accuracy.
Merely using highly diagnostic cues was insufficient for accurate monitoring; there was no effect of using either of the two most diagnostic cues on teachers' monitoring accuracy. Yet, when teachers judged one of these most diagnostic cues (i.e., correct relations in students' diagrams) more accurately when using it, their monitoring of students' text comprehension was also more accurate. It may seem self-evident that when relations in students' diagrams are judged more accurately, students' test scores are also judged more accurately as the test focuses on students' understanding of relations. Yet, the relation between used-cue judgment accuracy and monitoring accuracy of students' performance has not been investigated before.
Furthermore, we found that using some of the low diagnostic cues hampered teachers' monitoring accuracy (i.e., students' general reading comprehension levels, grades for other subjects, nationality, extraversion, IQ). A similar effect was found in Author (2018) when using a problem-solving task in Mathematics: teachers' monitoring of students' mathematics achievement was less accurate when they had non-diagnostic student cues available in addition to diagnostic performance cues. Teachers in our study judged the low diagnostic cues quite accurately (exception: students' nationality). Finally, for one cue (i.e., omissions), mere usage was related to more accurate monitoring. Yet, judgment of this cue was hardly needed as it only involved counting the number of blank boxes and question marks in diagrams. Surprisingly, when teachers judged the non-diagnostic cue students' general effort in class more accurately, their monitoring was also more accurate whereas mere use of this cue did not foster monitoring accuracy. For those effort judgments that were very accurate (absolute deviation < 0.30), the mean level of students' effort was somewhat lower (2.6) than for those effort judgments that were more inaccurate (absolute deviation >1) in which case the mean was 3.3. Perhaps, when monitoring effort more accurately and when student effort was relatively low, teachers may have lowered their judgments of students' test scores based on the somewhat lower effort level. Given that teachers generally overestimated students' test scores, lowering their judgments may have resulted in more accurate judgments of students' test scores. Yet, future research should further investigate this tentative explanation.

Limitations and future research
One limitation is that we measured cue-diagnosticity by calculating overall correlations between actual cue values and students' test scores. This group-level diagnosticity is useful when e.g., designing interventions. Nevertheless, it may be that a particular cue is somewhat more diagnostic for one student than for another student.
Furthermore, the instruments for measuring actual cue values of students' IQ, ability to reproduce facts, and general reading comprehension level did not perform sufficiently on one of the three quality indicators (i.e., internal consistency). We therefore need to interpret these results with caution. The low internal consistency may make it harder for teachers to judge these cues given that answers on items within cues are not necessarily consistent. Nevertheless, teachers judged the actual cue values of students' IQ and general reading comprehension very accurately (deviation 3e5%). Future research could investigate whether teachers' judgments of these cues would be similarly accurate when using instruments with higher internal consistency.
In addition, differences between texts regarding, for instance, length and difficulty were small. This may have caused low diagnosticity and may have prevented teachers e if they were aware of this e from making (more) use of these cues. Future research could further investigate the diagnosticity and cue-utilization of task cues when there is more variation in task characteristics. Moreover, although findings from RQ4 are highly relevant, our data do show whether the beneficial effect of accurately judging diagnostic cues occurred because teachers only used diagnostic cues, judged these cues accurately, and ignored non-diagnostic cues, or whether they did also used non-diagnostic cues but using these did not hamper their monitoring accuracy when using and accurately judging diagnostic cues. Future research could investigate this issue further.
Finally, we focused on teachers' monitoring of students' text comprehension. In other domains and with other tasks, effects of teachers' cue-utilization and used-cue judgment accuracy on their monitoring accuracy could be different. Yet, a previous study has found that when monitoring problem-solving tasks in Mathematics, teachers were most accurate when they only had diagnostic performance cues available (using anonymized student work) compared to having only student cues or performance and student cues (Author, 2018). Thus, similar to our findings, using nondiagnostic student cues seems to hamper teachers' monitoring accuracy also in other domains with other tasks, such as Mathematics.

Conclusion
The current study addresses teachers' monitoring of students' text comprehension when learning from texts describing causal relations, which is relevant for most subjects in secondary education. Prior research has shown that making information containing diagnostic information about students' text comprehension may be insufficient to improve teachers' monitoring accuracy. Our findings show that teachers also need to ignore non-diagnostic cues. Importantly, this study shows that deducing diagnostic cues from available information is a necessary but not sufficient condition for higher monitoring accuracy. Rather, teachers also need to judge cue values accurately if they are to accurately monitor students' text comprehension. Thus, although it has hardly received attention in the literature, teachers' used-cue value judgment accuracy seems to form an indispensable part of the monitoring process. If future research would show this finding to be robust, it could add significantly to theoretical and/or process models of teacher monitoring such as the cue-utilization model.
Our findings also have relevance for designing interventions to improve teachers' monitoring accuracy. For instance, it may be useful to raise teachers' awareness of which cues are diagnostic (and should be used) and which are not (and should be ignored) and to help teachers in accurately monitoring the most diagnostic cues either by themselves or with the aid of technology such as learning analytics.

Author note
Correspondence concerning this manuscript should be addressed to Janneke van de Pol (j.e.vandepol@uu.nl), Utrecht University, Department of Education, PO Box 80.140, 3508 TC Utrecht, The Netherlands, þ31 302531796. Preliminary results of this study have been presented at the biennial conference of the European Association for Learning and Instruction, Aachen, Germany, 2019.