In recent decades, there has been a shift from knowledge-based to competency-based assessments of psychotherapy skills requiring not only the completion of a certain amount of coursework but also the application of skills in practice (Fouad et al., 2009; Kaslow et al., 2007; Kring et al., 2022). Therapist competence may thus be conceptualized as “the degree to which a therapist demonstrates the general therapeutic and treatment-specific knowledge and skills required to appropriately deliver cognitive-behavioral therapy (CBT) interventions which reflect the current evidence base for treatment of the patient’s presenting problem” (Muse & McManus, 2013, p. 485). In a similar vein, Barber and colleagues (2007) define competence as “the judicious application of communication, knowledge, technical skills, clinical reasoning, emotions, values, and contextual understanding for the benefit of the individual and community being served” (p. 494).

There are several reasons for the assessment of psychotherapy competence, as these conceptualizations underline the practical application of therapeutic knowledge and skills. First, the assessment of therapist competence is essential for facilitating trainees’ skill acquisition and fostering therapists’ continued professional development (Muse et al., 2022; Weck et al., 2021). Accordingly, the measurement of therapist competence is especially useful for trainees to obtain feedback on their performance and to develop their skills as therapists (Muse & McManus, 2013). Second, competence assessments play a critical role in the empirical evaluation of CBT because “research trials cannot draw valid conclusions regarding the efficacy of CBT protocols unless the competence with which the protocols are delivered can be established” (Muse & McManus, 2013, p. 485). Third, therapist competence has been discussed as a moderator of treatment outcomes, indicating that competence assessments could foster optimal CBT effectiveness for patients (Kuyken & Tsivrikos, 2009; Power et al., 2022; Strunk et al., 2010; Zarafonitis-Müller et al., 2014). In summary, the assessment of psychotherapeutic competence seems beneficial for research trials, professional training and clinical care.

Assessment of Therapist Competence

For competence assessments, the Cognitive Therapy Scale (CTS; Young & Beck, 1980) and its revised version (CTS-R; Blackburn et al., 2001) have been widely used in empirical studies representing a ‘benchmark’ for clinical skill assessment (Kühne et al., 2020; Muse et al., 2022). It has been shown that with appropriate training and monitoring, raters’ assessments of therapist competence using the CTS or its revised version are reliable (Kazantzis et al., 2018). The CTS has been applied not only to rate encounters with clinical patients but also as a tool to assess the competence of trainees in standardized role play settings (Kühne et al., 2022). In the context of training future therapists and assessing their skills, standardized role plays have been proposed as a promising and valid tool to measure therapist competence while reducing variability in patient presentation (Fairburn & Cooper, 2011; Kaslow et al., 2007; Muse & McManus, 2013). Standardized role play settings involve standardized patients (SPs) who are trained and provided with in-depth role scripts and implement their role play following clinical scenarios (Kühne et al., 2022). If trained comprehensively, SPs have been shown to act authentically, with trainees being unable to differentiate real patients from SPs (Ay-Bryson et al., 2022). Standardized role plays also involve therapists, i.e., trainees interacting with the SP conducting elements of a therapy session or a specific therapeutic task (Fairburn & Cooper, 2011). Evaluations based on such standardized situations (i.e., standardized role plays) might be particularly suitable for comparing competence assessments by different raters.

Perspectives on the Assessment of Therapist Competence

Different perspectives for skill assessment have been proposed, including assessments performed by patients, therapists, supervisors and independent judges (Weck, 2013). Several studies have compared therapists’, supervisors’ and/or independent judges’ ratings of competence. Most studies have reported that therapists overestimate their competence in self-ratings compared to independent judges (Caron et al., 2020; Dennhag et al., 2012a; Rozek et al., 2018), while one study reported underestimation (McManus et al., 2012). In a similar vein, supervisors’ agreement with independent judges’ ratings has been reported to be fairly low (Caron et al., 2020; Dennhag et al., 2012a). Hence, supervisors as well as therapists assess competence subjectively. Independent raters evaluating audio or video recordings of therapy sessions are thus considered more objective and a ‘gold standard’ for therapist assessment (Muse et al., 2022; Weck, 2013). However, all perspectives may provide valuable insights into the therapeutic process (Muse et al., 2022).

Whether therapists’ self-ratings or supervisors’ or independent judges’ ratings are used, it is necessary to ensure the reliability of their assessments (Roth & Pilling, 2007; Waltz et al., 1993). Reliability has been shown to vary greatly, most likely depending on raters’ degree of clinical expertise and the amount of training needed to achieve reliable assessments (Kühne et al., 2020; Muse & McManus, 2013; Weck, 2013). Rater training and the rating procedure are criticized for being cost and resource intensive, yet it is assumed that raters without expertise and training are unlikely to reliably assess therapist competence (Muse & McManus, 2013). The discussion about the expertise needed to evaluate therapeutic competence is ongoing. Some experts argue that competence assessment “is a complex skill that needs to be fostered through experiential training and requires experience and expertise” (Muse & McManus, 2016, p. 254). However, experimental studies comparing novice raters, i.e., trained psychology students, and expert raters, i.e., trained therapists, showed that novices were able to evaluate competence with satisfactory to high reliability (Weck et al., 2011; Weck, Weigel, Weck et al., 2011a, b, c). Novices’ assessments might be especially adequate when the material refers to basic therapeutic tasks such as psychoeducation (Weck, Weigel, Weck et al., 2011a, b, c) or when evaluating highly standardized settings. Considering cost-efficiency and availability, novice raters may thus be an alternative to expert raters, thereby reducing barriers to the implementation of competence assessments for teaching and research purposes.

Factors Influencing the Reliability of Competence Assessments

In addition to clinical expertise and training, other factors, such as the number of raters (Weck, 2013), the number of sessions rated per patient (Dennhag et al., 2012b) or whether session segments vs. entire sessions are rated (Weck et al., 2011), have been considered to influence reliability. While a recent meta-analysis revealed that none of these moderators were significant, larger samples are needed to determine their actual importance (Kühne et al., 2020). Furthermore, other underresearched factors, such as the camera perspective as a practical component of recordings, might also influence reliability. Typically, patients and therapists are recorded from the side, which reduces the possibility of watching their facial expressions in detail, thereby limiting the availability of information. However, research on oncology patients has shown that the camera focus may influence spectators’ engagement with video vignettes and their perceived realism (Visser et al., 2018).

The Current Study

While research on competency-based assessments of psychotherapy skills has gained momentum, some limitations and unanswered questions remain. As criticized by a meta-analysis on interrater reliability (IRR; Kühne et al., 2020), previous studies have used small samples of less than 30 tapes, limiting their statistical power. As studies on therapist competence ratings are often a byproduct of other main trials (Kühne et al., 2020), standards for reporting these ratings are not always met, resulting in a lack of information on rater training, feedback or IRR. Moreover, few studies have compared independent raters with different levels of expertise (Weck et al., 2011; Weck, Weigel, Weck et al., 2011a, b, c), and no study has yet examined the possible influence of the camera perspective.

The objective of this study was to compare the IRR of expert and novice raters considering their level of expertise. The ratings were based on video recordings of standardized role play interactions including SPs and predefined CBT tasks in a laboratory setting. Therefore, this study had four aims:

  1. 1.

    to compare experts’ and novices’ mean ratings of trainee therapist competence,

  2. 2.

    to determine the IRR of expert and novice raters,

    • Hypothesis 1

      We expected good IRR of expert and novice raters with ICCs ≥ 0.75.

    • Hypothesis 2

      We expected expert raters to show significantly higher reliability in the assessment of more complex skills, i.e., psychotherapeutic competence and therapeutic alliance but not in the assessment of basic counseling skills and empathy.

    • Hypothesis 3

      We expected good concordance (ICCs ≥ 0.75) between expert and novice raters.

  3. 3.

    to examine whether different camera perspectives (i.e., standard perspective vs. three perspectives) influenced IRR, and

  4. 4.

    to explore IRR development with an increasing number of rated videos.

Methods

Study Overview

Data for the current study were drawn from a randomized controlled trial on modeling CBT skills as a key strategy for increasing therapeutic competence (Kühne et al., 2022). Sixty-nine trainees were randomly assigned to an intervention group (IG, manual reading plus modeling) and a control group (CG, manual reading). Trainees participated in two role plays before training (T0), two role plays after training (T1) and two role plays three months later (T2) treating trained SPs. In both groups, training included reading a manual instruction on behavioral activation and exploring cognitive bias based on an evidence-based CBT depression manual (Hautzinger, 2013). Participants had 20 min to read the manual instructions. In the intervention group, trainees additionally watched a video of an experienced licensed cognitive-behavioral psychotherapist who skillfully demonstrated behavioral activation and exploring cognitive bias with an SP. Each video lasted 20 min. Participants were instructed to implement what they had read or watched during the role plays. Participants were given 20 min to complete the role plays. All role plays were video recorded. The trainees were psychology students from the University of Potsdam with a mean age of 25.58 years (SD = 6.43) and were mainly female (81.2%) and Caucasian (81.2%). The majority were undergraduate students (84.1%) and did not have previous experience in treating mental health patients (69.6%). Participants had not received specific CBT training before. The SPs were seven undergraduate students from disciplines other than psychology. Their mean age was 22.29 years (SD = 2.14), six were female, and all were Caucasian. Prior to the role play sessions, the SPs were trained by a licensed psychotherapist and participated in a 12-hour workshop on portraying symptoms of depression. The SPs were blinded to the experimental conditions.

In contrast to the main study, the current study focused on the reliability of the competence ratings of three rater dyads. The current study used all available videos regardless of the time the video was recorded (pre/post intervention, follow-up). In total, the raters evaluated 359 videos each.

Expert and Novice Raters

Three rater dyads participated in this study: the Expert, Novice A, and Novice B dyads. All raters were female and Caucasian. The characteristics of and differences between rater dyads are shown in Table 1. All raters participated in the same study workshop and received training on the use of the instruments by two licensed therapists (FK, FW). Throughout the workshop, the raters started using the rating instruments with an example video, discussed their ratings and had the opportunity to ask questions on the rating procedure and the study in general. All raters made their ratings independently. Furthermore, all raters were blinded to the study condition, timing of the video (pre/post intervention, follow-up) and trainee’s previous experience. It took approximately 20 min to watch and assess one video. The ratings were conducted over a period of 12 months. In sum, each rater assessed 359 videos. To ensure reliability and to counteract rater drift, all raters participated in six additional 1 h rater trainings throughout the study. For all rater trainings, ratings were only discussed and not changed afterward.

Table 1 Rater characteristics

Camera Perspective

The Expert and Novice B dyads watched all videos from a lateral perspective, which focused on the bodies and faces of the SPs and the study therapists from the side (i.e., standard perspective). The Novice A dyad only watched half of the videos from this standard perspective. The other half of the videos were rated by watching a screen showing three perspectives at the same time, including (a) the standard perspective described above, (b) a frontal perspective that solely focused on the SP, and (c) a frontal perspective that solely focused on the trainee therapist (see Fig. 1). The latter option provided the raters with a potentially broader range of information and details (e.g., facial reactions, empathic understanding).

Fig. 1
figure 1

Camera Perspectives. Note. (a) standard perspective, (b) frontal perspective that solely focused on the SP, and (c) frontal perspective that solely focused on the trainee therapist

Measures

Counseling Skills

For the assessment of basic communication skills, the German version of the Clinical Communication Skills Scale – Short Form (CCSS-S; Maaß et al., 2022) was used. The CCSS-S is a 14-item measure using a 4-point scale ranging from 0 (not at all appropriate) to 3 (entirely adequate). An example item is “The therapist uses open questions to motivate the patient to talk”. Maaß et al. (2022) report moderate to good interrater reliabilities (ICC(2,2) = 0.65 − 0.80) and high correlations with other competence rating scales (rs > 0.86 − 0.89).

Psychotherapeutic Competence

For the assessment of psychotherapeutic competence, the German version of the Cognitive Therapy Scale (CTS; Weck et al., 2010) combining items of the CTS (Young & Beck, 1980) and the CTS-R (Blackburn et al., 2001) was used. The German CTS is a 14-item measure using a 7-point rating scale ranging from 0 (poor) to 6 (excellent). For the current study, only 11 items were used as three items could not be rated due to the specific task trainees had received in this study: (2) dealing with problems/questions/objections, (3) clarity of communication, (4) pacing and efficient use of time, (5) interpersonal effectiveness, (6) resource activation, (8) using feedback and summaries, (9) guided discovery, (10) focusing on central cognitions and behavior, (11) rationale, (13) appropriate implementation of techniques, and (14) assigning homework. The internal consistency for the German CTS was good in one study (α = 0.86; Weck et al., 2010) and a previous meta-analysis reported fair to excellent interrater reliability across several studies, ICCs = 0.42 − 0.97 (Kühne et al., 2020).

Therapeutic Alliance

For the evaluation of therapeutic alliance, the German version of the Helping Alliance Questionnaire (HAQ; Bassler et al., 1995; Luborsky, 1984) was used. The HAQ is an 11-item measure using a 6-point scale ranging from 1 (strongly disagree) to 6 (strongly agree). The wording was changed to reflect that the raters had to evaluate the relationship between the therapist and patient. An example item is “I believe the patient is working together with the therapist in a joint effort”. The interrater reliability of the mean HAQ score was satisfactory in a previous study, ICC(2,2) = 0.73; p < .001 (Richtberg et al., 2016).

Empathy

For the evaluation of therapist empathy, the German version of the Empathy Scale (Partschefeld et al., 2013; Persons & Burns, 1985) was used. The ES is a 10-item measure using a 4-point scale ranging from 1 (strongly disagree) to 4 (strongly agree). The wording was changed to reflect that raters had to evaluate the therapist’s behavior. An example item is “The things the therapist says and does makes the patient feel they can trust the therapist”. Partschefeld et al. (2013) report good internal consistency (α = 0.84 − 0.89) and good interrater reliability (ICCs = 0.82 − 0.85). Due to their workload, the expert raters evaluated empathy for only 200 videos.

Data Analysis

Descriptive statistics were obtained for all outcome measures. As we first aimed to investigate mean differences in competence assessments on all measures (CCSS-S, CTS, ES, and HAQ), we compared the means of all rater dyads using analyses of variance and post hoc tests (Tukey’s HSD). Regarding our second aim, to determine the IRR of novice and expert raters, we computed ICCs for the rater dyads. Following Shrout and Fleiss (1979), a two-way random effects model (absolute agreement) using the mean of two raters (ICC(2,2)) was computed. As proposed by Koo and Li (2016), ICCs between 0.50 and 0.75 were considered moderate and those of 0.75 and higher were considered good. Following these guidelines, we expected good IRR (Hypothesis 1). In order to compare ICCs between experts and novices, Fisher Z-transformation was used (McGraw & Wong, 1996; Hypothesis 2). Concordance between expert and novice raters was calculated using ICCs based on mean expert and mean novice ratings (Hypothesis 3). Regarding our third aim, to examine the influence of camera perspective on IRR, Fisher Z-transformation was used (McGraw & Wong, 1996). Fourth, we descriptively analyzed the IRR development based on the number of rated videos.

All analyses were performed using SPSS Statistics 29 (IBM Corp, 2022). The percentage of missing data ranged from 0 to 1.7% for the CCSS-S only and was handled using listwise deletion. Power calculations indicated that given a minimal accepted reliability of an ICC = 0.60, an expected reliability of an ICC = 0.75, an α of 0.05, a 1-β of 0.80 (power) and k = 2 raters, a total of 114 videos for every dyad of raters would have been necessary (Arifin, 2023).

Results

Mean Competence Ratings

First, we investigated the differences between raters’ mean competence assessments using ANOVA. The results and post hoc tests for all rater dyads are depicted in Table 2. The post hoc tests showed that the Expert raters gave significantly lower competence scores than the Novice B dyad on the CTS, CCSS-S and HAQ. The Expert raters also gave significantly lower competence scores than the Novice A dyad on the HAQ but not on the CTS or the CCSS-S. For the ES, the assessments of the Expert dyad were significantly better than those of the Novice A and Novice B dyads. The Novice A dyad gave significantly lower competence scores than the Novice B dyad on all measures.

Interrater Reliability

Table 3 presents the IRR and confidence intervals of the rater dyads for all measures. Our first hypothesis, according to which both expert and novice raters would be able to evaluate competence with good IRR, was mostly not supported. The expert dyads’ mean ICCs were good for the CCSS-S and moderate for the CTS, HAQ and ES. Regarding the Novice A dyad, the mean ICCs were good for the CTS and HAQ and moderate for the CCSS-S and ES. For the Novice B dyad, mean ICCs were moderate to poor across all measures.

Table 2 Comparison of mean competence ratings
Table 3 Interrater reliabilities for expert and novice competence ratings

Our second hypothesis, according to which experts would assess more complex therapeutic competence (i.e., CTS, HAQ) but not empathy (ES) or basic counseling skills (CCSS-S) with significantly higher reliability than novices, was not supported. Contrary to our hypothesis regarding the CTS, there was no difference in IRR between the Expert and the Novice B dyad (z = 1.53, p = .063), and the ratings were significantly less reliable in the Expert dyad than in the Novice A dyad (z = -1.84, p < .033). Similarly, for the HAQ, the ratings were significantly less reliable in the Expert dyad than in the Novice A dyad (z = -3.52, p < .001). For the CCSS-S, the assessments were more reliable in the Expert dyad than in the Novice A dyad (z = 3.00, p < .001) but not in the Novice B dyad (z = 1.50, p = .067). Contrary to our hypothesis, for the ES, differences also showed significance, with higher reliability in the Expert dyad than in the Novice B dyad (z = 4.50, p < .001).

According to our third hypothesis, we expected good concordance between mean expert and novice ratings. This was supported only for the Expert and Novice A dyads. In contrast, the concordance between the Expert and Novice B dyads was only moderate (see Table 3).

Camera Perspective

To address the third aim of this study regarding the influence of the camera perspective on IRR, we compared the ICCs of all measures based on camera perspective (i.e., standard perspective vs. three perspectives) for the Novice A dyad. No significant differences were detected (all ps > 0.05; see Supplementary material 1).

Explorative Analysis

According to our fourth aim, we explored IRR development descriptively based on the number of rated videos. The exemplary IRR development for the CTS is shown in Fig. 2. The results regarding the other measures are reported in Supplement 2.

Fig. 2
figure 2

Development of IRR for the CTS Based on the Number of Rated Videos

The ICCs of all rater dyads varied considerably over the course of the ratings. For the Expert dyad, agreement improved after approximately 140 video ratings but decreased afterward. Agreement improved for the Novice A dyad after 140 videos and continued to improve, reaching a stable level after 200 videos. The agreement of the Novice B dyad gradually decreased with an increasing number of rated videos.

Discussion

This study investigated the IRR of expert and novice raters for the basic counseling skills (CCSS-S), psychotherapeutic competence (CTS), empathy (ES) and therapeutic alliance (HAQ) of trainee therapists in a standardized setting. Our results must be discussed considering that measurement quality is largely influenced by rater-, instrument- and sample-related factors (Kottner et al., 2011).

First, we aimed to determine the IRR of expert and novice raters, expecting good IRR. In contrast to our hypothesis, the IRR was moderate rather than good, with the Expert and Novice A dyads achieving the highest IRR and the Novice B dyad achieving moderate to poor IRR. As the Novice B dyad did not receive supervised rater trainings but discussed ratings on their own, their insufficient IRR might be a direct result of the lack of training and supervision. In addition, it must be assumed that even highly experienced and trained raters are, to some degree, subject to several forms of errors and biases (Eckes & Jin, 2021, 2022). Rater effects such as severity/leniency bias or halo effects may thus have had an impact on our results, as well as rater characteristics such as agreeableness (Ceh et al., 2022). In this way, some of our results may be due to systematic rater effects rather than differences between training or clinical experience, especially given the low number of raters involved. Furthermore, we observed that the ICCs of two rater dyads dropped over the course of the ratings. This is line with findings showing that the reliability of ratings tends to plateau when evaluating one trainee for a certain amount of time (Kring et al., 2022). This might indicate that the quality of the ratings does not necessarily improve with higher quantity. Although the raters in the current study were advised to pause during the ratings, they may have experienced fatigue or loss of motivation due to repeatedly watching and evaluating videos with the same content (Suess & Schmiedeck, 2000). Future studies should therefore pay attention to rater effects (such as response bias and halo effects) and involve more raters to minimize these effects (Dennhag et al., 2012a, b). Additionally, future research could include other perspectives on competence assessment, such as those of supervisors or therapists themselves (Muse et al., 2022).

Second, we expected the IRR on complex psychotherapeutic competences of experts to be higher than that of novices. Interestingly, the Novice A dyad achieved significantly higher reliability than the Expert dyad for psychotherapeutic competence and therapeutic alliance. This is especially intriguing given that the experts had both higher rating expertise and higher experience in conducting CBT. Our results thus indicate that rating expertise and therapeutic experience do not necessarily ensure high IRR. As our results also showed great heterogeneity in ICCs with no rater dyad achieving consistently high reliability on all measures, aspects related to instruments need to be discussed as well. The highest IRR across rater dyads occurred for the CTS and CCSS-S, which is in line with previous studies using comparably trained and experienced raters in standardized settings (Alpers & Hengen, 2021). Of note, our results are only based on selected items of the CTS, which may has influenced validity. Inconsistent and low ICCs appeared, particularly for the ES and HAQ. For empathy assessments, the Expert dyad gave significantly higher mean ratings than novice dyads but the IRR of all raters was only moderate. Relatedly, the assessment of the Expert dyad as well as the Novice B dyad for therapeutic alliance was only moderate. It is possible that novice raters might be less familiar with overarching therapy concepts (such as therapeutic alliance), making their assessment difficult. On the other hand, low ICCs might also be due to specific characteristics of the scales, such as the ambiguous phrasing of items. This is especially true for scales or items asking raters to evaluate higher level constructs and a therapist’s general understanding of the patient (Schmidt et al., 2018). Future studies should therefore discuss raters’ understanding of the assessed concepts throughout their training and use scales with verbal anchors defining the meaning of each or some points (Weck et al., 2010). Our results moreover point out the need for rating scales with more specific and observable criteria.

Third, we aimed to determine the concordance between expert and novice raters. In line with our hypothesis, IRR was good for the Expert and Novice A dyads. Concordance between the Expert and Novice B dyad was only moderate with wide confidence intervals. Raters in our study were trained thoroughly and received regular training by the same two supervisors, ensuring that all raters based their ratings on the same information and guidelines. Only the Novice B dyad did not receive supervised training and discussed ratings on their own. This might explain the large difference between their mean evaluations and those of the other rater dyads; the Novice B dyad gave significantly higher competence ratings and had low concordance with the Expert dyad. While for the most part, previous studies reported the inclusion of regular meetings for raters to discuss their ratings (Dittmann et al., 2017; Karterud et al., 2013; Weck et al., 2011a), we suggest that additional supervision in such meetings could further improve the rating process.

Fourth, we aimed to explore the influence of camera perspective on IRR. We did not find significant differences based on camera perspective. These results are in line with a meta-analysis pointing out that across IRR studies on competence assessments, no moderators could be identified (Kühne et al., 2020). In fact, empirical evidence for moderators of competence ratings remains scarce (Muse et al., 2022). Although we did not find any effects of the camera perspective on which the competence ratings were based, future research should continue to investigate possible moderators of IRR. Studies using large samples are needed to detect moderators such as rater training or biases.

This study also adds to the knowledge about competence assessments of trainees at an early skill acquisition stage. We thereby tapped into a particularly sensitive stage of competence development, in which reliable competence assessments are essential for further skill consolidation (Muse et al., 2022). However, the competence ratings were limited to the assessment of student trainees based on interactions in a standardized setting using well-structured and predefined tasks as well as SPs. This limits the data’s generalizability and reduces external validity. These circumstances may offer an explanation for the good IRR of the Novice A dyad as well as their high concordance with the Expert dyad. Previous studies using naturalistic samples and more experienced therapists reported lower ICCs for novices and lower concordance between experts and novices (Weck et al., 2011; Weck, Weigel, Weck et al., 2011a, b, c). While the applicability to more naturalistic settings may be criticized, such a standardized approach has proven useful, indicating that at least for adherence ratings fewer standardized assessments are needed to achieve a reliable estimate (Imel et al., 2014). Subsequent studies should therefore include competence ratings derived from naturalistic settings and more experienced therapists to complement our findings (Kazantzis et al., 2018). In practice, future research could examine how trainees respond to feedback provided by novices compared to experts.

Conclusion

In this study, we explored ways to improve competence measurements. We contributed to competence and training research by using trained independent raters, controlling for rater drift and comparing raters with different levels of expertise and training. As a strength of the current study, all raters assessed over 350 videos. To date, only a few studies have used a similar number of videos, indicating great variability in IRR (Barber et al., 2004; Dennhag et al., 2012a, b). Moreover, we implemented several measures for competence assessments, as any single scale is unlikely to provide a comprehensive measure of therapist competence (Muse et al., 2022). Our findings are, however, limited to the role play setting of this study using well-structured and predefined tasks as well as SPs. We suggest that trained novices such as psychology students or therapist trainees may assess psychotherapeutic competencies in standardized situations with adequate reliability, especially when supervised and using well-established, unambiguous scales. Our findings, however, indicate that the IRR of novice raters might be highly dependent on their training and supervision throughout their ratings. While including novice raters might reduce barriers to the implementation of competence assessments, their inclusion could be limited to standardized settings or basic therapeutic tasks. In summary, this study underscores the importance and complexity of training future therapists.