Measuring teacher noticing: A scoping review of standardized instruments

(cid:1) Scoping review of 22 standardized test instruments that measure teacher noticing. (cid:1) Instruments are predominantly video-based and include operationalizations of different mental processes. (cid:1) Few instruments assess subject-speci ﬁ c noticing outside of mathematics teaching. (cid:1) Test quality varies considerably with no indication of internal consistency for some instruments. (cid:1) Validation by means of teacher knowledge, observed instructional quality and expert-novice comparisons is rarely conducted.

Item During instruction, teachers are simultaneously confronted with large amounts of information from which they must identify relevant instructional events, reflect on them, and determine appropriate responses. This process is often referred to as teacher noticing, which is broadly defined as "specialized ways in which teachers observe and make sense of classroom events and instructional details" (Choy & Dindyal, 2020).
Teacher noticing is considered a central component of teachers' professional competence Scheiner, 2016;Sherin, Jacobs, & Philipp, 2011a;Stahnke, Schueler, & Roesken-Winter, 2016) and considerable efforts have been made in recent years to develop various test instruments to measure noticing (e.g., *Seidel & Stürmer, 2014). With increasing recognition of the importance and complexity of teacher noticing further instruments are still being developed.
The development of instruments to assess teacher noticing has posed significant challenges owing to noticing's volatile nature as an "in-the-moment-practice" (Jacobs, 2017, p. 273). Common approaches use classroom artifacts: for example, a teacher might watch a video clip of children discussing a mathematical problem and may then be asked to identify relevant utterances, interpret the children's mathematical understanding, and infer what the most appropriate response to the children would be ( *Jacobs, Lamb, & Philipp, 2010). However, conceptualizations and operationalizations of noticing are heterogeneous across the various existing instruments.
As has been pointed out in the systematic literature review by K€ onig et al. (2022) most empirical studies on teacher noticing deploy qualitative approaches that provide detailed accounts of the nature and development of teachers' noticing. Standardized measurement of noticing, by contrast, is an important addition, as it enables the study of noticing in large samples of teachers and provides the basis of hypothesis testing. In this review, we therefore focus on standardized measurement approaches to noticing. These approaches allow researchers to empirically test theoretical assumptions, such as the conceptualization of noticing as a learning outcome of teacher education or as a correlate of professional knowledge. As the quality of research is limited by the quality of measures implemented (e.g., DeVellis, 2017), high-quality testing of noticing is needed to draw valid conclusions about the underlying theory.
Despite the recent publication of several systematic literature reviews on teacher noticing (Amador, Bragelman, & Castro Superfine, 2021;K€ onig et al., 2022;Santagata et al., 2021;Stahnke et al., 2016), no literature review to date has focused on standardized testing. An overview of existing noticing instruments will be useful, providing researchers with the necessary information to select instruments that are appropriate for their research goals, develop adequate measurement approaches and validation strategies, and identify areas for further test development. We conducted a scoping review, which is a suitable approach for this purpose (Munn et al., 2018;Noordink, Verharen, Schalk, van Eck, & van Regenmortel, 2021), to map existing instruments according to three main focal points. First, we describe the different conceptualizations of noticing that underlie the test instruments. Second, we focus on the test design, including the actual operationalization of teacher noticing. Third, we provide an overview of how researchers examined the quality of their instruments. Overall, this scoping review aims to identify research gaps and provide recommendations for future research.

Central conceptualizations of teacher noticing
The discourse on teacher noticing is characterized by various ways of describing and conceptualizing noticing, with teachers' perception as a core element (see Dindyal, Schack, Choy, & Sherin, 2021). In particular, terminological inconsistencies exist: for example, some researchers write about "teacher noticing," while others prefer the term "professional vision." Often, it remains unclear whether these two terms denote different constructs or represent similar concepts. In the section that follows, we aim to clarify this terminological issue by structuring the discourse according to four theoretical perspectives on teacher noticing, which we proposed in two systematic reviews: a socio-cultural perspective, a cognitive-psychological perspective, an expertise-related perspective, and a discipline-specific perspective (K€ onig et al., 2022;Santagata et al., 2021).
The concept of "professional vision" is central to the socio-cultural perspective, originating from the work of Goodwin (1994), who developed this concept with a focus on lawyers and archeologists. Goodwin (1994) described how professional visiondthat is, a specialized way of seeing and understanding meaningful events in a professional contextdis developed and shaped by social interaction in professional communities. Goodwin (1994) argued that professional vision as "the ability to see a meaningful event is not a transparent, psychological process, but instead a socially situated activity" (p. 606). Emphasizing the role of social interaction provides an important perspective on the acquisition of professional competence. However, owing to the focus on social interaction instead of the individual mind, this approach has been taken up only indirectly for the standardized testing of noticing. Goodwin's (1994) general concept of professional vision was adapted for the teaching profession by Sherin and van Es (Sherin, 2001;Sherin, Russ, Sherin, & Colestock, 2008;Sherin & van Es, 2009), who focused on how participation in interactive video clubs shapes teachers' perception and sense-making of classroom interaction. This adaption of professional vision for the teaching profession, however, also prompted a shift in perspective: while ideas of socio-cultural embeddedness, which were immanent in the work of Goodwin (1994), were less emphasized, Sherin and van Es (with others) maintained a stronger focus on the mental processes in which teachers engage during instruction. The conceptualization of teacher noticing as a set of interrelated mental processes was characterized by K€ onig et al. (2022) as a cognitiveepsychological perspective on noticing. The shift in perspective further entailed a shift in research methodology: while professional vision, as described by Goodwin (1994), lends itself to ethnographic or qualitative approaches, a focus on the individual teacher's cognitive processes may be regarded as a reference point for the standardized testing of noticing.
In their earlier work, Sherin and van Es (with others) referred to the "professional vision" construct, including the sub-processes of selective attention (also referred to as "noticing") and knowledgebased reasoning (also referred to as "interpreting") (Sherin, 2001;Sherin et al., 2008;Sherin & van Es, 2009). However, in their more recent work, they use "teacher noticing" to denote the overall construct, including subdimensions such as "attending to particular events" and "making sense of particular events" (Sherin, Jacobs, & Philipp, 2011, p. 5). Studies that were particularly important for measurement purposes foregrounded the term "professional vision" (e.g., *Seidel & Stürmer, 2014). Drawing on the work of Sherin and van Es, Seidel and Stürmer (2014) developed a standardized test instrument to assess teachers' professional vision, the Observer Research Tool, which has been influential for subsequent research. This has led to two consequences: first, professional vision and noticing are both commonly used to denote comparable sets of teachers' mental processes during instruction, thus suggesting interchangeability (see Huang, Miller, Cortina, & Richter, 2021). Second, the notion of "professional vision" has become increasingly independent of the socio-cultural perspective developed and elaborated by Goodwin (1994). Since the term "teacher noticing" is more commonly used in international research, in the following, we use "noticing" as a generic term and speak of "professional vision" only in reference to instruments for which this term is explicitly used by the authors who developed those instruments.
Aside from the two perspectives outlined so far, the roots of teacher noticing can also be found in teacher expertise research. This expertise-related perspective on noticing draws on the work of Berliner (1988), who studied the development of teaching skills from novice to expert. Although the term "noticing" is not used in this framework, studies on teacher expertise "can be regarded as precursors" (Lachner, Jarodzka, & Nückles, 2016, p. 198) because the concepts on which these studies have focused (e.g., interpreting and predicting classroom events) share similarities with the mental processes advocated by the cognitiveepsychological perspective. For example, Sabers, Cushing, and Berliner (1991) demonstrated that expert teachers outperform novices with respect to their perception, monitoring, and interpretation of classroom events. Regarding research methodology, the expertise-related perspective emphasizes inter-individual differences in teachers' noticing skills, which is a precondition for the development of standardized measures.
Finally, research on teacher noticing was influenced by a fourth approach, characterized as a discipline-specific perspective. This approach was developed by Mason (2002), who understood noticing as a discipline in which teachers engage to enhance their sensitivity to classroom events. Teacher noticing is thus regarded as a "collection of practices designed to sensitize oneself so as to notice opportunities in the future in which to act freshly rather than automatically out of habit" (Mason, 2011, p. 61). As Mason (2021) notes, "the Discipline of Noticing […] is phenomenological in nature, being concerned with the lived experience of the practitioner" (p. 231) and, thus, does not directly relate to standardized measurement.
Although the four abovementioned theoretical perspectives share commonalities, their conceptualiziation of noticing differs with respect to focus and theoretical orientation, thereby implying different methodological orientations. In the context of standardized testing, which has been used in only a small proportion of studies on teacher noticing (K€ onig et al., 2022), the cognitiveepsychological perspective is particularly relevant, since mental processes provide reference points for the operationalization of noticing. However, researchers have yet to reach a consensus on how many and which mental processes constitute noticing. On the one hand, van Es and Sherin (2002) distinguished "(a) identifying what is important or noteworthy about a classroom situation; (b) making connections between the specifics of classroom interactions and the broader principles of teaching and learning they represent; and (c) using what one knows about the context to reason about classroom interactions" (p. 573). On the other hand, *Jacobs et al. (2010) focused on teachers' professional noticing with respect to children's mathematical thinking and developed a model that included three sub-processes: attending to children's strategies, interpreting children's understanding based on the observed strategies, and deciding how to respond. *Kaiser et al. (2015) similarly conceptualized noticing as an interaction between perception, interpretation, and decision-making. This so-called PID-model is closely connected to Bl€ omeke, Gustafsson, and Shavelson's (2015) conceptualization of competence as a continuum. For this model, the situation-specific skillsdperception, interpretation, and decision-makingdare conceptualized as mediator variables between cognitive and affective dispositions on the one hand and teaching performance on the other hand.
By contrast, *Seidel and Stürmer (2014) considered "professional vision" to include two components: "noticing" and "reasoning." Noticing, as *Seidel and Stürmer (2014) understood it, denotes teachers' attention to relevant instructional events, taking into account goal clarity, teacher support, and learning climate. It is thus similar to the attending component of noticing in the conceptualizations described above. Reasoning, the second component of professional vision as conceptualized by *Seidel and Stürmer (2014), denotes teachers' interpretation of instructional events based on their professional knowledge. It is divided into three distinct but interrelated processes: (1) describing relevant instructional events based on professional knowledge; (2) explaining instructional events, including the connections between different events of the teaching-learning process; and (3) predicting the impact of instructional events on teaching and learning processes (Seidel, Blomberg, & Stürmer, 2010;*Seidel & Stürmer, 2014). To conclude, the terms "noticing" and "professional vision" must be considered when examining instruments. Furthermore, it is important to consider which mental processes are differentiated and operationalized by different instrument developers.

Measuring teacher noticing
Teacher noticing primarly refers to what teachers "notice" during instruction ("in-the-moment-noticing, " Sherin, Russ, & Colestock, 2011) and how they deal with what they have noticed, which is comparable to Sch€ on's (1983) concept of "reflection in action." Since instructional practice is highly complex and testing requires standardization of the testing situation, researchers may struggle to create representations of practicedfor example, using video clips, transcripts of instructional practice, or student written work samples (e.g., Dreher & Kuntze, 2015a;*Jacobs et al., 2010).
Existing instruments may be characterized along several dimensions, such as the underlying theoretical frameworkdwhich includes the conceptualization of noticing (i.e., which mental processes were characterized) and the domain-specific focus (i.e., what aspects of teaching and learning are "noticed")dand the test design, including the stimulus material used (e.g., video clips of teaching practice) and the test items (e.g., open-ended or closedended questions). Regarding the underlying theoretical framework, researchers are first challenged to choose or develop an accurate conceptualization of noticing that includes how many and which mental processesdcalled noticing facets (e.g., perception, interpretation, and decision-making)dare measured and how these facets can be operationalized (e.g., *Kaiser et al., 2015). Furthermore, noticing is commonly measured with a domain-specific focus representing the aspects of teaching and learning in which teachers engage while taking the test. The focus may relate to subject matter content (e.g., *Steffensky, Gold, Holodynski, & M€ oller, 2015) or to general pedagogical aspects (e.g., *Seidel & Stürmer, 2014) or both (e.g., *Bl€ omeke et al., 2015).
The test design refers specifically to the construction of tasks or items used to measure a construct of interest, combining them for a test instrument as well as to scoring procedures and test administration (American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 2014). Measures used for teacher noticing commonly employ stimulus material consisting of artifacts of instructional practicedmostly video clipsdin combination with writing prompts or closed-ended questions (*Jacobs et al., 2010;*Kaiser et al., 2015;*Seidel & Stürmer, 2014;*Steffensky et al., 2015). Underlying these approaches is the implicit assumption that teachers who watch videos of instructional practice engage in cognitive processes comparable to those that they encounter during their own instruction. Moreover, the development of an accurate scoring systemdthat is, defining correct and incorrect answersdposes particular challenges when measuring noticing, since scholars must define what constitutes "correct" or "incorrect." Some studies have also applied specific technologies to examine teacher noticing during instructiondfor example, small wearable cameras combined with subsequent recall interviews (Sherin, Russ, & Colestock, 2011). Eye-tracking was used to investigate teachers' gaze behavior while watching videos of instructional practice (see Grub, Biermann, & Brünken, 2020). Kosko, Heisler, and Gandolfi (2022) studied pre-service teachers' head movements while the teachers viewed a 360-degree video using a virtual reality headset. However, such methods cannot easily be applied to large samples and do not fully allow for standardization of the testing situation. Therefore, these approaches are not included within the scope of our review.
To conclude, the development of standardized noticing measures, particularly those based on video material, poses significant challenges for researchers (Jacobs, 2017;*Kaiser et al., 2015;Nickerson, Lamb, & LaRochelle, 2017). Thus, the provision of an overview of the underlying conceptualizations and the test designs of existing instrument may facilitate future test development. During the test development process, both the underlying conceptualization and the test design are closely connected to the assessment of test quality.

Test quality
For this scoping review, we focused on the two classical aspects of test quality: reliability and validity, which were also emphasized in the Standards for Educational and Psychological Testing (AERA et al., 2014). Drawing on classical test theory, reliability denotes the precision of a measurement indicated by "the correlation between scores on two equivalent forms of the test" (AERA et al., 2014, p. 33). In broader terms, reliability denotes "the consistency of scores across replications of a testing procedure, regardless of how this consistency is estimated or reported" (AERA et al., 2014, p. 33), thus referring to a range of possible coefficients (e.g., Cronbach's alpha, generalizability coefficients etc.).
Building on reliability, "validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests" (AERA et al., 2014, p. 11). To investigate validity, it is thus necessary to clarify a test's intended interpretation and to collect evidence in support of this interpretation. Possible sources of evidence include the analysis of test content (with respect to the construct addressed), the response processes (e.g., cognitive processes while taking the test), the test's internal structure (e.g., using factor analysis), and relationships to other variables (e.g., test-criterion relationships) (AERA et al., 2014).
While AERA et al. (2014) understand validity as a "unitary construct" (p. 14), earlier conceptualizations differentiated three validity types, which are still commonly used in research: (1) content validity, requiring the test items to be an adequate sample of all possible items that measure the construct; (2) criterionrelated validity (also called predictive validity), focusing on the empirical relationship between a measure and a criterion measure; and (3) construct validity, which requires the test score's correlation with other variables to be consistent with theoretical assumptions regarding the relationship between the construct measured and other constructs or measures (Cronbach & Meehl, 1955;DeVellis, 2017).
Regarding the quality of tests used to measure noticing, it is worth examining which concrete operations or strategies are used to assess reliability and validity. An overview of existing approaches can help provide a guideline for providing validity evidence concerning existing and newly developed instruments and reveal desiderata in test validation.

Research questions
The present paper aims to provide a scoping review of existing standardized test instruments used to assess teacher noticing by addressing three research questions: 1. How was the noticing construct conceptualized, including the overarching concept, the mental processes (noticing facets) distinguished, and the domain-specific focus? 2. How were test instruments for teacher noticing designed with respect to stimulus materials, test items, scaling, and scoring? 3. How was test quality examined, specifically in relation to the required quality standards of reliability and validity?
The answers to these questions will enhance our scientific knowledge of the state of the field and help identify key research gapsdthat is, areas in which further research with existing instruments or even the development of new test instruments is required.

Method
To address the questions raised above, we conducted a scoping review. Addressing exploratory research questions, scoping reviews encompass the mapping of existing evidence in a topic area based on a systematic literature search thereby identifying research gaps and allowing first insight into the field (see Colquhoun et al., 2014). Literature selection, data collection, and reporting were in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews (PRISMA-ScR; Tricco et al., 2018).

Selection process
We conducted a systematic literature search to identify relevant papers. We first searched using the terms "teacher* AND notice*" as well as "teacher* AND professional vision." 2 As described above, the terms "noticing" and "professional vision" are regularly used interchangeably or as closely related terms. We included both since we did not wish to exclude relevant literature for terminological reasons.
The search was conducted across five online databases (ERIC, PsycINFO, ScienceDirect, Scopus, and Web of Science) and considered the titles, abstracts, and keywords of the publications. No restrictions were placed on publication year or publication type. This procedure resulted in 7205 publications in June 2019 following the removal of duplicates.
To screen publication titles and abstracts, we applied the following three inclusion criteria: (1) publication in a peerreviewed journal; (2) publication in English; and (3) explicit focus on teacher noticing in the publication. Articles not published in peer-reviewed journals (n ¼ 2831) were excluded to ensure that only high-quality publications were considered; publications in languages other than English (n ¼ 962) were excluded to ensure a high level of accessibility; and publications that did not focus on teacher noticing (n ¼ 3186) were excluded to ensure that only publications relevant to our purposes were selected. This screening yielded a total of 226 peer-reviewed English-language journal articles focused on teacher noticing. Full-text versions were then retrieved and reviewed by the authors, and the publications' relevance was assessed. Publications had to meet the two following inclusion criteria: (1) relevance to the discourse on teacher noticing, and (2) use of standardized testing to assess teacher noticing. Publications in which teacher noticing was not a construct or phenomenon of interest in the full-text version (n ¼ 44) and publications that did not use standardized tests to measure teacher noticing (n ¼ 145) were excluded. AERA et al. (2014) defined a test as "a device or procedure in which a sample of an examinee's behavior in a specified domain is obtained and subsequently evaluated and scored using a standardized process" (p. 2). Thus, a measure was considered to be standardized testing if the score indicating the participants' noticing capability was computed based on a standardized procedure. This yielded 37 publications that were ultimately included in this review. Fig. 1 summarizes the selection process.

Data charting and synthesis of results
The first coding phase included the entire sample of articles on teacher noticing (n ¼ 182), which were screened for relevance in the course of evaluating the 226 full-text versions and excluding 44 articles in which teacher noticing was not a construct or phenomenon of interest (see Fig. 1). This sample was used for a broader literature review on conceptualizations of noticing and research methods used to study it (K€ onig et al., 2022). For this purpose, a coding scheme focusing on conceptualizations and research methods was developed by reviewing a subsample of 20 articles. A first version of the coding scheme was applied to an additional 20 articles and revised as necessary. All 182 articles were coded according to this final version, including double coding for 20%. Coding was conceptualized as dichotomous, meaning that the coder had to determine whether the article included a particular item of information. Interrater reliability can be described as good The coding team discussed unclear coding decisions for all articles. With respect to the present review focusing on test instruments, this first coding scheme was used to retrieve information concerning the study design (e.g., cross-sectional, pre-post), the sample surveyed (e.g., in-service teachers, pre-service teachers) and to identify those articles that included standardized testing of teacher noticing.
A second phase of coding was conducted for the present review, and included articles that reported on the standardized testing of teacher noticing (n ¼ 37). It should be noted that these 37 articles describing test instruments were only a small proportion of the whole literature selection (n ¼ 182) indicating that the theoretical framework and methodology had to be investigated in more detail. Based on a review of these articles, a second coding scheme was developed to focus on.
-the construct being measured, including o overarching concept (e.g., noticing, professional vision), o noticing facets (e.g., perception, interpretation, and decisionmaking), and o domain-specific focus (e.g., student thinking, subject matter content); -the test design, including o stimulus material (e.g., video clips), o item format and number (e.g., number of rating items), 2 The truncation symbol was added to ensure that the search results included all possible word-endings, particularly plural forms and gerunds. Searches using only the term "vision" yielded numerous references that were irrelevant to this review. We therefore used the complete term "professional vision." o scaling (e.g., mean scale, sum scale), and o scoring (e.g., scoring based on a coding manual); -and the test quality, including o reliability (e.g., internal consistency, interrater agreement) and o validity (e.g, content validity, construct validity).
This coding scheme included dichotomous codes as well as open categories (e.g., nomenclature of noticing facets/domain-specific focus, values of coefficients etc.). Fifteen articles were doublecoded, indicating good interrater reliability (M Kappa ¼ .84, The remaining articles were coded by one coder. For both coding phasesddouble and single codingdthe research team discussed any discrepancies and unclear decisions. All coding categories from both coding manuals ultimately used for this review can be found in supplementary material A, including kappa values. To provide an overview of the tests instruments identified in this review, charted data are presented in tables, including short presentations of each test instrument (see Table 1 and supplementary materials B and C). A synthesis of the results is provided in the text.

Basic characteristics of the articles
Of the 37 articles included in this systematic literature review, the earliest was published in 2008. Most were published by European researchers (25), while the remainder were authored by researchers from North America (10) and Asia (2), including one collaboration between Chinese and European researchers (*Yang, Kaiser, K€ onig, & Bl€ omeke, 2019). Cross-sectional designs (22), pre-post designs (15), and a longitudinal design with more than   (2013) The instrument was designed for pre-service teachers in different subjects. Using 112 rating items and six video clips (about 3.5 min) a of various subjects, participants are asked to agree/disagree with statements about observed instruction. Ratings are scored as correct if they match an expert rating.

Professional Vision
Knowledge-based reasoning Description Explanation Prediction Goal clarity Teacher support Positive learning climate 'Observer Extended'  This is a modified version of the Observer, designed to survey pre-service teachers and teacher candidates during their induction phase. The instrument includes 10 video clips (about 3 min) and 41 rating items, and covers more areas of teaching than the original version. The instrument is an equivalent version of the TEDS-FU Test (primary) focusing on secondary early-career teachers, who had participated in the study TEDS-M for secondary mathematics teachers. The video vignettes refer to central topics in school mathematics from years 8e10. The instrument includes 38 rating items and 36 open-response items.

Noticing
Perception Interpretation Decision-making General pedagogy-related aspects Mathematics instruction-related aspects 'Video Case Diagnosis task' *Dalvi and Wendell (2017) The instrument was used for pre-service primary teachers, engineering students, and teacher educators with experience in engineering and science education. It includes one video clip (6 min) of primary students working on an engineering design problem. Using four prompts, participants are asked to identify children's ideas and practices regarding science and engineering and to suggest responses to develop the children's understanding. The participants receive the video's transcript to work on these tasks. Based on an expert solution, coding rubrics were developed for scoring.

Noticing
Noticing Responding Science ideas Engineering practices

Multiple Representations Questionnaire
Dreher and Kuntze (2015a) Dreher and Kuntze (2015b) This instrument was designed for pre-service and in-service mathematics teachers. It includes four short transcripts of fictious teacher-student interaction, all focused on one specific aspect of school mathematics. For each vignette, the participants work on one writing prompt. Answers are scored as correct if the participants evaluate the teacher's response negatively and justify this evaluation by referring to a change in representations of fractional arithmetic.

Noticing
Holistic approach (theme-specific noticing) Multiple representations in mathematics classes Noticing Measure by Jacobs et al. *Jacobs et al. (2010) The instrument was designed for pre-service and in-service primary teachers as well as emerging teacher educators with different experience in children's mathematical thinking. Two assessments are combined: one is based on a video clip (9 min), the other contains three samples of students' written work. Both assessments refer to primary school mathematics classrooms and include three writing prompts: participants are asked to (1) describe, (2) interpret, and (3) (2017) The instrument was designed for pre-service mathematics teachers and includes three 12th grade student written work samples with each representing a different approach to a mathematics problem (algebra and function). Participants work on three writing prompts (describing,  (2019) The participating pre-service teachers engage in online video-based learning environments that focus on professional noticing. Based on these learning environments, participants then work on performance-based tasks during their own instruction. Work on the performance-based tasks is evaluated by the researchers using Likert scales.

Noticing
Holistic facet Students' mathematical thinking Monitoring Competence Assessment Tool *Kaendler, Wiedmann, Leuders, Rummel, and Spada (2016) *Wiedmann, Kaendler, Leuders, Spada, and Rummel (2019) The instrument was designed for pre-service teachers and teacher candidates during their induction phase. It includes three short, scripted videos (about 1 min) showing groups of three students aged around 13 years. The videos depict students solving mathematics problems using collaborative, cognitive, and metacognitive activities. Participants rate 32 dichotomous items in terms of whether descriptive statements on students' activities are true. The answers are correct if they match an expert rating.

Professional vision
Describing meaningful classroom events Student interaction in collaborative learning settings Comparative Judgment Instrument (primary) *Keppens et al. (2019) The instrument was designed for pre-service primary teachers and contains 15 video clips (around 2 min) showing (inclusive) primary classrooms. Noticing is measured by comparative judgments: videos are presented pairwise and participants judge which of the two videos is better regarding two aspects of inclusive teaching (20 judgments). The participants score higher if their judgments deviate less from an expert rank order. Reasoning is measured using Likert scales; participants rate how important certain arguments were for their judgments (33 rating items).  (2016) The instrument was designed for pre-service teachers (mathematics/ science/informatics) and includes one video clip (3.5 min) showing mathematics instruction with a focus on adaptive strategies. The participants are asked to describe, and evaluate the adaptive instruction, and to create alternatives by means of three open-ended tasks. To measure 'selective attention', coders assess the number of perceived events; further aspects of knowledge-based reasoning are rated using a coding system.

Noticing
Selective attention Knowledge-based reasoning Reasoning process Explanation/use of concepts Dealing with negative events Dealing with positive events Adaptive teaching Assessment Scheme of Professional Vision of Self-Regulated Learning ('SRL-PV assessment scheme') *Michalsky (2014) The instrument was designed for pre-service mathematics teachers and includes one video clip (25 min) of a high school mathematics lesson. The participants are prompted to specify the time stamp in the lesson when they notice that the teacher teaches self-regulated learning. They are further asked to describe and explain this situation and to predict how these instructional events will develop self-regulated learning. The participants' utterances are coded into four levels of professional vision depending on which processes are identifiable in the utterances.

Professional vision
Noticing Knowledge-based reasoning Describing Explaining Predicting Self-regulated learning Direct delivery mode Indirect delivery mode Analyzing Teacher Moves Test *Scherrer and Stein (2013) The instrument was designed for in-service mathematics teachers and includes one video (2.5 min) of secondary mathematics classroom discussion. Participants receive the transcript and ten (mostly openended) questions focusing what they paid attention to, what they appreciated and what alternative strategies they would propose. The answers are scored as correct if attention was on teacher-student interaction (unit of noticing), if specific codings were used (language use), and if opportunities to learn were related to teacher-student interaction. The number of possible points is not restricted.  (2008) The instrument was used for pre-service mathematics teachers and includes one video of a whole mathematics lesson (45 min). After watching the video, the participants work on 61 items of several formats. Items refer to clearly observable facts and thus do not require any interpretation (e.g., participants are asked to list as many names of students from the video as they remember).

Noticing
Attending ( *Weber, Gold, Prilop, and Kleinknecht (2018) The instrument was designed for pre-service and in-service primary teachers. It includes four video clips (about 3 min) that depict extracts of primary science lessons. Using 47 rating items, the participants disagree/ agree with statements referring to classroom management in the observed instruction. The ratings are scored as correct if they match an expert rating. The instrument was used for pre-service and in-service primary teachers. It includes six video clips (about 3.5 min) which mainly show teacherclass interaction during primary science lesson. Similar to the PVCM Test, participants work on 68 rating items focusing instructional support. The ratings are scored as correct if they match an expert rating.

Professional vision
Noticing Interpretation Instructional support (in science classes) Structuring two measurements ( *Stürmer, Seidel, & Holzberger, 2016) were reported. 3 The samples studied included pre-service teachers in 28 articles and in-service teachers in 15 articles, with seven articles including both pre-service and in-service teachers.

Identification of test instruments
The 37 papers included a total of 22 different test instruments, some of which were used in multiple papers. These test instruments are outlined in Table 1. 4 Some of the identified test instruments relate to one another: this concerns the instrument "Observer" (Blomberg, Stürmer, & Seidel, 2011), for which an extended version ("Observer Extended") was reported by *Stürmer and Seidel (2015). Two instruments were derived from the project "Video-based lesson analysis: Early science," referred to as "Professional Vision of Classroom Management Test" (PVCM Test, *Gold & Holodynski, 2017) and "Professional Vision of Instructional Support Test" (PVIS test, *Todorova, Sunder, Steffensky, & M€ oller, 2017). Two further instruments come from the project "Teacher Education and Development StudydFollow Up" (TEDS-FU) and are called "TEDS-FU Video Tests" (primary and secondary) *Kaiser et al., 2015). Two instruments are from the project "Potential: Power to teach all" and labeled "Comparative Judgment Instruments" ( *Keppens, Consuegra, Goossens, Maeyer, & Vanderlinde, 2019;*Roose, Goossens, Vanderlinde, Vantieghem, & van Avermaet, 2018). In addition, *Jacobs et al.'s (2010) data collection approach, which consists of three open-ended questions (describing, interpreting, and deciding how to respond), has been adopted by other researchers using different stimulus materials and coding procedures (*Fisher et al., 2018(*Fisher et al., , 2019*Schack et al., 2013;*Simpson & Haltiwanger, 2017). This led to similar but distinct instruments, which we labeled as "Noticing Measures" followed by the first author's name (see Table 1). Table 1 presents the overarching concepts and noticing facets distinguished for each test instrument. The overarching concept was "noticing" for 13 instruments and "professional vision" for nine instruments.

Noticing concept and noticing facets
As the final column of Table 1 indicates, considerable heterogeneity emerged with respect to which noticing facets were addressed and how these facets were named. One approach to structuring the field is to assign the facets to one of three categories: (1) perceiving/attending, (2) reasoning/interpreting, and (3) deciding/responding (see Fig. 2). Regarding the instruments based on the noticing concept, for six instruments, noticing was found to include all three categories. Regarding the professional vision concept, the conceptualizations commonly focus on categories (1) and (2) (seven instruments). 5 For both versions of the Observer and for the "Assessment Scheme of Professional Vision of Self Regulated Learning" (SLR-PV) developed by *Michalsky (2014), the measurement of category (2) was further differentiated into description, explanation, and prediction. 6 Some instruments are restricted to certain categories: for example, the "Non-Interpretative Noticing Measure" ( *Star & Strickland, 2008) only addresses attending. For the Observer ( *Seidel & Stürmer, 2014), only knowledge-based reasoningdthat is, description, explanation, and predictiondis measured, although noticing was considered a subdimension of professional vision on the theoretical level. By contrast, holistic measurement approachesdwherein noticing is measured as a single construct with no distinction of processesdwere used for only three instruments: the "Multiple Representations Questionnaire" (Dreher & Kuntze, 2015a), the "Students' Course Outcomes" by *Johnson et al. (2019), and *Theelen et al.'s (2019) "Tagging Assessment." Table 1 includes some less common conceptualizations, such as *Kleinknecht and Gr€ oschner's (2016) "Noticing Measure of  (2019) The instrument was designed for pre-service teachers (various subjects) and includes three video clips (about 3.5 min) showing extracts of secondary instruction. The participants are asked to 'tag' the clips, i.e., note three to five important aspects about teacher-student relationship in the video. The 'tags' are then coded with respect to the analytical level: (1) descriptive, (2) evaluation, (3) analytic, and (4) prescriptive.

Professional vision
Holistic approach Interpersonal teacher behavior 'Video Assessment of Interactions and Learning' (VAIL) *Wiens and Gromlich (2018) The instrument was originally developed for early childhood teacher but used for in-service and pre-service teachers from various domains. It includes three videos (about 2.5 min) of pre-school language arts classrooms. After each video, the participants are prompted to identify up to five teaching strategies and give specific examples from the video. The answers are scored by means of a coding manual. Each strategy-example pair is coded with respect to the identified strategies and examples, the match between strategy and example and the breadth of identified strategies. This results in 58 possible points for the version used *Wiens and Gromlich (2018).

Noticing
Skill Knowledge [Facets rather refer to the scoring method than to teachers' mental processes] Teaching strategies Instructional supports Classroom organization Instructional supports Notes. a The video durations given in parenthesis are the approximate duration per video included in the test. b It should be noted that the term "professional vision" does not indicate a socio-cultural perspective on noticing.
3 *Seidel and Stürmer's (2014) publication was double counted because it reported two studies. 4 When a unique name for the instrument was used in the publications, this name was adopted for the present review (indicated by single quotation marks in Table 1). For the remaining instruments, we chose names that we felt represented the most salient features of the respective instruments.

Domain-specific focuses
The test instruments typically focus on several domains (i.e., aspects of teaching and learning). The categories identified were student thinking (10 instruments), subject matter content (12), classroom management (6), and general pedagogy (13). The final column of Table 1 lists each instrument's focus. The 12 instruments relating to subject content focused on mathematics (9 instruments), science (2), and mathematics and science (1).
The domain-specific focus also depends on the subject and grade level to which the stimulus material used relates. Most test instruments (16) use stimulus material from a single subject: mathematics (12), science (3), and language (1), 7 while six instruments refer to two or more subjects. For two instruments, the subjects were not specified. In terms of school level, primary (8) and secondary (13) levels are frequently addressed, while the VAIL ( *Wiens & Gromlich, 2018) is the only instrument that includes material from pre-school language art classrooms.

Test design
In the following subsections, we focus on general trends in test and item design. Further details on the stimulus material and the items used can be found in supplementary material B and Table 1. 8

Stimulus material
The vast majority of 20 instruments use video material, while only the Multiple Representations Questionnaire (Dreher & Kuntze, 2015a) includes written vignettes, and only *Simpson and Haltiwanger (2017) Noticing Measure includes written samples of student work. The video material used generally consists of authentic classroom practice, whereas scripted video vignettes are used only for the TEDS-FU Video Tests  and the "Monitoring Competence Assessment Tool" (*Kaendler et al., 2016). The number of video clips (Min ¼ 1, Max ¼ 15) and the length of video clips (M ¼ 6 min 36 s, SD ¼ 11 min 15 s) vary considerably between instruments. The use of three to six clips of 2 to 3 min duration is the most common approach.

Item format and item design
The choice of item format is a critical aspect of test development (see *Kaiser et al., 2015). We distinguished between open-response items, dichotomous items, rating items, and comparative judgments. Most instruments (17)  To gain a deeper understanding of what teachers are asked to do during the assessment, we analyzed the sample items provided (see supplementary material C for details) in terms of item design principles. Rating items typically assess the extent to which the individual agrees with statements regarding observed instructional practice. This approach was used for the two Observer instruments (Blomberg et al., 2011;, the two TEDS-FU Video Tests *Kaiser et al., 2015), and the PVCM and PVIS Tests . Depending on the respective noticing facet, the statements are descriptive (e.g., " The teacher clarifies what the students are supposed to learn") or require an explanation (e.g., "The students have the opportunity to activate their prior knowledge of the topic") or prediction (e.g., "The students will be able to align their learning process to the learning objective") (Observer; Blomberg et al., 2011). The Monitoring Competence Assessment Tool (*Kaendler et al., 2016) adopts a similar approach, measuring the capacity to describe meaningful teaching events using dichotomous items (e.g., "Group members ask each other questions when they do not understand something," true/false). For the TEDS-FU Video Test (secondary), *K€ onig et al. (2014) further distinguished rating items assessing precise perception (e.g., "The teacher presents the lesson's task visually AND acoustically"). For the two Comparative Judgment Instruments (*Keppens et al., 2019;*Roose et al., 2018), the comparative judgment method is used to measure noticing, which is understood as an attentional sub-process of professional vision. The version that applies to primary classrooms (*Keppens et al., 2019) includes rating items to capture the reasoning process. Test participants rated how important various arguments were to their previous judgments, with higher importance corresponding to higher reasoning scores.
Open-response items prompt test participants to describe, interpret, or generate responses to aspects of the stimulus material (e.g., "Please describe in detail what you think each child did in response to this problem," *Jacobs et al., 2010). In addition, items in this format may require participants to apply their knowledge of concepts and theories to the stimulus material. For example, the VAIL (*Wiens & Gromlich, 2018) asks participants to identify five instructional strategies from the video clip and provide a specific example for each strategy. The TEDS-FU Video Test (secondary)  includes an item that asks test participants to describe the mathematical solution approaches of three pairs of students and explicitly targets the corresponding academic expressions (enactive-iconic-symbolic).
*Theelen et al.'s (2019) Tagging Assessment adopted a less typical approach: participants were asked to note three to five aspects from each video that they considered relevant to the teacherestudent relationship.

Scoring and scaling
For open-ended item formats, test scores are assigned based on a coding scheme or manual (15 instruments). Expert responses are used to validate this scoring method (*Dalvi & Wendell, 2017;*Kaiser et al., 2015). For almost all instruments containing closed item formats, test scores were determined by comparing participants' ratings with those of a sample of experts. Table 1 provides brief descriptions of the scoring procedures for each test instrument.
Most instruments use scales based on classical test theory (e.g., sum or mean scales). Three instruments determine the test score with a single rating of one open-ended response. For six instruments, more sophisticated procedures based on item response theory (IRT) were used to estimate test scores.

Reliability
Of the reliability measures based on classical test theory, internal consistency represents almost the only measure used in the present selection, with the exception of one study that reported retest reliability (Observer; Seidel & Stürmer, 2014). Cronbach's a is calculated by seven instruments and shows high reliability (M ¼ 0.85, Min ¼ 0.64, Max ¼ 0.98). A summary of the reliability coefficients can be found in supplementary material D, Table 1.
Reliability measures based on IRT are reported for six instruments, including weighted likelihood estimation (WLE), expected a posteriori estimation/plausible values (EAP/PV), and scale separation reliability: both TEDS-FU Video Tests *Kaiser et al., 2015), both versions of the Observer (Blomberg et al., 2011;, and both Comparative Judgment Instruments (*Keppens et al., 2019;*Roose et al., 2018). For three additional instruments, error variance due to nesting of items in the video clips was estimated using generalizability theory (Monitoring Competence Assessment Tool; *Wiedmann et al., 2019) and omega hierarchical (PVCM and PVIS Tests; *Steffensky et al., 2015). None of the above coefficients were reported for 11 instruments. A measure of interrater reliability was calculated for 9 of these instruments, while no reliability measure was found in the publications for the remaining two instruments.

Validity
Regarding the traditional classification of content, construct, and criterion-related validity, a validity type was coded as explicitly addressed if authors used the technical term to describe their procedure or if clearly attributable measures were reported. 9 While content validity (14 instruments) and construct validity (8) are frequently considered, investigations of criterion-related validity are rare (3).
Content validity was primarily assessed by asking experts about the validity of the items (12 instruments) and the stimulus material (11). For example, the experts assessed whether the video material was authentic and relevant to the domain-specific focus and depicted frequent and relevant classroom events *Schack et al., 2013;*Seidel & Stürmer, 2014). This approach goes hand in hand with the selection of appropriate video material (e.g., *Gold & Holodynski, 2017). In addition, experts rate the relevance of items and provide answers themselves, which are used to create a master rating with sufficient agreement among experts (e.g., *Kaiser et al., 2015).
Construct validity is commonly addressed by examining the internal structure of a test using factor analysis or IRT modeling (6 instruments). Group comparisons are reported as a measure of construct validity for Schack et al.'s Noticing Measure (see *Fisher et al., 2018) and the "Video Case Diagnosis task" ( *Dalvi & Wendell, 2017).
Relations to other variables with correlation or regression are reported in 11 papers (see supplementary material F). Although these papers do not explicitly target test validation, the results may be interpreted in the context of validity. Six publications relate noticing to professional (declarative) knowledge and demonstrate significant correlations between 0.25 and 0.56 (Dreher & Kuntze, 2015b;*Gold & Holodynski, 2017;*Kaiser et al., 2017;*K€ onig et al., 2014;*Meschede et al., 2017). Two publications show that noticing is related to teachers' beliefs (*Meschede et al., 2017;*Roose et al., 2019). *Bl€ omeke et al. (2015) also examined the relationship between noticing, knowledge, and beliefs but compared teachers' profiles instead of using correlation. Using latent class analysis, they found that teachers with favorable knowledge and belief profiles attain higher noticing scores.
Six publications relate teacher noticing to aspects of (teacher) education and professional experience and report significant effects (*Keppens et al., 2019;*Stürmer et al., 2015). However, three publications found no significant effects for teaching experience or length of internship in school (*Roose et al., 2019;*Stürmer et al., 2015;*Todorova et al., 2017). Regarding high school grade point average, the Observer and the PVCM Test show no significant effects *Todorova et al., 2017), while a significant effect was observed for the VAIL ( *Wiens & Gromlich, 2018).
Significant differences in test scores among groups with different levels of expertise demonstrate the test's sensitivity and indicate construct validity (Cronbach & Meehl, 1955). Thirteen publications report group comparisons; however, most do not focus on validity. Three publications draw comparisons between experts (e.g., teacher educators) and novices, and five publications draw comparisons between in-service and pre-service teachers. Six publications compare different groups of pre-service teachers (e.g., bachelor and master students), while four publications report comparisons among in-service teachers. The group comparisons are summarized in supplementary material G. Overall, the results demonstrate that the instruments have high sensitivity to different levels of expertise with varying effect sizes (d Min ¼ .19, d Max ¼ .84).
Some operations used to address construct validity can also be used to demonstrate criterion-related validity. For two instruments, comparisons between in-service and pre-service teachers are interpreted as evidence of criterion-related validity (PVCM and PVIS Tests; *Gold & Holodynski, 2017;*Meschede et al., 2015). For the VAIL only, criterion-related validity has been investigated by examining the correlation between test scores and observed instructional quality (Jamil et al., 2015).

Discussion
Using a scoping review approach, this study provides an overview of existing standardized instruments for studying teacher noticing, thereby identifying research gaps in this area. Based on a sample of 37 articles published between 2008 and 2019, we identified 22 different test instruments and examined (1) the theoretical conceptualization of noticing, (2) the test design, and (3) the test quality.

Summary of main results
Most instruments differentiate noticing into distinct mental processes while more holistic approaches are rare. The domainspecific focus of noticing varies considerably between instruments, with mathematical aspects predominantly investigated in subject-specific noticing.
In terms of test design, most instruments include video material from classroom practice to elicit noticing, typically using one to six video clips of up to 5 min in duration. The amount of video material, item format, item formulation, and number of items vary considerably, with the latter ranging from three writing prompts to more than 100 rating items. Test scores are commonly determined by comparing participant responses to an expert solution for closedended item formats or using a coding scheme for open-response items. Resulting test scores ranged from single values and mean scales up to IRT estimations for a small number of instruments.
Regarding test quality, high reliability scores were reported for around half of the instruments, with no reliability measures reported for only a few instruments. Validity examination was guided by the traditional division into content, construct, and criterionrelated validity, with the latter rarely examined. Internal structure was analyzed by considering the different noticing facets as well as content-specific aspects, yielding heterogeneous results with regard to the structure of noticing, including its sub-processes. Few studies have investigated noticing in relation to knowledge and beliefs. Sensitivity to differences between groups with different levels of expertise is more frequently confirmed. However, experts and novices are rarely compared.

Conceptualizations
As the results suggest, the conceptual heterogeneity within the discourse on teacher noticing is equally evident in the field of standardized testing. With regard to the overarching concept, the instruments in the present selection were assigned to either "noticing" or "professional vision." Although the instruments based on these two concepts differ in terms of domain-specific focuses and measurement approaches (supplementary material C, Tables 2, 3 and 4), the different terms used (i.e., noticing and professional vision) should not obscure the fact that the underlying constructs, including their sub-processes ("noticing facets"), are strikingly similar. As outlined in the theory section, it should be noted again that the term "professional vision" refers more strongly to a set of mental processes, however retaining the original idea of professional vision as specialized way of seeing and understanding events in a professional context. Differences between the constructs measured by instruments relate to the inclusion of a teacher's response, which can be found for noticing instruments. By contrast, the prediction facet is only captured by professional vision instruments. However, the decision-making facet of noticing in *Kaiser et al. (2015) includes anticipation, which is similar to the prediction facet of professional vision in *Seidel & Stürmer's (2014) study.
Another conceptual difference between noticing and professional vision in the context of standardized testing is that the perceiving/attending facet of noticing is further divided into (selective) attention and description in professional vision. However, this difference is not necessarily relevant at the empirical level. For example, the Observer ( *Seidel & Stürmer, 2014) does not explicitly capture noticing as an attentional sub-process of professional vision but limits itself to capturing description. The attending facet of noticing can be measured by asking participants to describe what they observe (e.g., *Jacobs et al., 2010).

Inconsistencies in the measurement approaches
The following test instruments have been developed, characterized by an elaborate reliability investigation and a comprehensive validation procedure: the Observer (Blomberg et al., 2011), the PVIS and PVCM Tests , the two Comparative Judgment Instruments (*Keppens et al., 2019;*Roose et al., 2018), the TEDS-FU Video Tests (primary and secondary) *Kaiser et al., 2015), the VAIL ( *Wiens & Gromlich, 2018), and the Monitoring Competence Assessment Tool (*Kaendler et al., 2016).
For the Observer, the TEDS-FU Video Tests, and the Comparative Judgment Instruments, test quality is examined by means of IRT. The Observer is also characterized by a validation procedure including a survey on the appropriateness of the video clips, the examination of the factor structure, and the investigation of repeated measurement effects (*Seidel & Stürmer, 2014). For the TEDS-FU Video Tests, instructional practices were scripted and videotaped to ensure a high density of relevant instructional events on key topics in school mathematics . The PVCM Test ( *Gold & Holodynski, 2017) is distinctive in that it examines factor structure using a bifactor model that takes into account that test items are nested within video clips. The PVCM and PVIS Tests are both used to compare differences between in-service and pre-service teachers, including the investigation of measurement invariance (*Gold & Holodynski, 2017;*Meschede et al., 2017). The VAIL is distinguished by its criterion-related validation with observed instructional quality (Jamil et al., 2015). *Wiedmann et al. (2019) implemented analyses based on generalizability theory, offering a promising approach to control for measurement error caused by video clips for the Monitoring Competence Assessment Tool. Finally, the Comparative Judgment Instruments provide an alternative that assesses noticing holistically (*Keppens et al., 2019;*Roose et al., 2018).
In contrast to the measurement procedures described above, no reliability measures (other than interrater reliability) were provided for a large proportion of the instruments. Similarly, for some instruments, validity evidence was only reported selectively, for example focusing exclusively on group comparisons. This finding indicates that the available methods for ensuring psychometric quality are not yet sufficiently used for some instruments.
Measurement approaches are inconsistent with respect to operationalizing noticing facets, particularly regarding the category of perceiving/attending. *Gold and Holodynski (2017) noted that measuring noticing (understood as selective attention) using standardized items is difficult because the item text directs attention. Also, for the Observer, noticing (understood as selective attention) is located in the process of video selection by the research team, whereas the rating items only assess reasoning (*Seidel & Stürmer, 2014). In the TEDS-FU Video Tests, perception is restricted to clear perceptual incidents and measured using both rating items and open-ended questions, while other processes, such as interpretation, are measured using open-response items . Another approach to measuring noticing as an attentional sub-process of professional vision is the use of comparative judgments, even though a judgment about teaching quality certainly involves interpretation.
Finally, test instruments differ in the degree of declarative knowledge required to achieve a high test score. One group of instruments, often based on closed-ended questions, requires an estimation of the focused aspects of teaching quality, which is supposed to be knowledge-based. By contrast, other instruments, often based on open-response items, target the application of explicit, recallable knowledge, including technical terms. This suggests an overlap between noticing measures and contextualized testing of teachers' professional knowledge.

Implications and directions for future research
In addition to providing an overview of available instruments, this review aimed to identify research gaps that can be considered reference points for further research. First, the results suggest that instruments may be developed to assess further aspects of noticing. For example, the testing of subject-specific noticing is currently limited mainly to the subject of mathematics. Given that situated approaches to teacher competence, which have commonalities with the concept of teacher noticing, are gaining weight, researchers from other disciplines might consider whether the standardized testing of subject-specific noticing can enrich the competence assessment in their field. A first approach focuses on the subject-specific noticing of biology teachers (Kramer et al., 2020). Similarly, the decision-making facet was only rarely operationalized using high-quality testing. Since decision-making can be considered a central mediator to teachers' behavior during instruction, further test development should emphasize this facet.
Moreover, central empirical issues concerning construct and criterion-related validation have rarely been addressed, including the differences between experts and novices; the development of noticing during teacher education and the professional career; the relationship between noticing and teacher cognition, including knowledge and beliefs; and the study of criterion measures, such as observed teaching quality or student learning progress. Given that (construct) validity is closely related to the construct's theoretical conceptualization, to address these desiderata, the theoretical foundation of noticing must be strengthened. In this context, an important theoretical reference point is the expertise approach (Berliner, 1988;Sabers et al., 1991), which could stimulate future studies to examine the differences between experts and novices and the development from novice to expert. Bl€ omeke et al. (2015) offer another compatible framework that considers teacher noticing described as situation-specific skills as a mediator between cognition and performance. This approach has not been adequately explored empirically and could stimulate the investigation of noticing as a correlate of teacher cognition and teaching performance. However, using these theoretical approaches for test development and validation should be critically evaluated by researchers since unilateral theoretical approaches might bias research results. For example, research based on a noticing measure that was constructed to be a correlate of professional knowledge might lead to overestimating the relevance of specific parts of teachers' professional knowledge such as declarative knowledge acquired from teacher education.
Another promising approach to construct validation, which received scant attention in our literature selection, is analysis of the correlation between different noticing measures. On the one hand, this may include the correlation between two or more standardized measures targeting at noticing (e.g., the Observer and the VAIL). On the other hand, it may be of interest to determine whether standardized noticing test scores are associated with the other modalities of measurement outlined above, including (mobile) eyetracking, data retrieved from small wearable cameras, or the investigation of head movements while viewing 360-degree videos. In this context, it should be highlighted that no multitraitmultimethod analysis was reported in the selected literature, which would have allowed the effects of different measurement approaches (e.g., item formats) on the measurement of similar or distinct constructs to be taken into account.
The finding that the validation strategies mentioned above were rarely addressed by no means implies that each of these strategies should be used for each test instrument. By contrast, a stronger orientation toward the testing standards (AERA et al., 2014) might help advance the research field. This includes the explication of intended test interpretations and the collection of empirical evidence supporting these specific interpretations by drawing on several sources of evidence (see *Keppens et al., 2019). Here, the fit between validation measures and intended test use is crucial. Moreover, a sufficient diversity of validation approaches, for example drawing on different theoretical approaches, will help avoiding biases caused by a one-sided view on the construct. Against this background, researchers should also report results that are not in accordance with common theoretical assumptions (e.g., unexpected correlations) to support the further development of theory.

Limitations
The results of this scoping review are limited by the criteria used to select relevant literature. Owing to the focus on English-language articles published in peer-reviewed journals, books, and book chapters, publications in other languages were not included. Similarly, we omitted publications that aimed at comparable situation-specific competencies without using the terms "noticing" or "professional vision." This concerns instruments such as videobased tests of classroom management expertise (K€ onig & Kramer, 2016) and teachers' knowledge of teaching mathematics (Kersting, 2008). Furthermore, recent developments such as the "Video Assessment of Teacher Knowledge" (Wiens, Beck, & Lunsmann, 2020) are not considered; however, this instrument focuses on knowledge rather than noticing. Nonetheless, this review may help raise awareness of the differentiation between video-based measures of teacher noticing and video-based measures of teacher knowledge that do not rely on teacher noticing.
Finally, the comparably broad range of kappa values suggests that, for some concepts, it was difficult to find a shared understanding. We addressed this issue by presenting tables that included all information relevant for coding, thus ensuring the transparency of our analysis.

Conclusion
We consider the results of this scoping review to be encouraging with respect to the development of quality test instruments to measure teacher noticing. Several high-quality test instruments were identified as providing measurement approaches and validation strategies that other researchers can draw on. Given the heterogeneity of existing instruments outlined above, the overview provided by this review might support researchers in carefully conceptualizing the noticing construct they aim to measuredincluding the definitions and nomenclature of the noticing facets addresseddand in judiciously selecting adequate measurement approaches and appropriate validation strategies.

Funding
This work was supported by the German Ministry of Education and Research [grant numbers: 01PK19006A, 01PK19006B].

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
Not applicable.