Competence Ratings in Psychotherapy Training – A Complex Matter

Paunov, Tatjana; Weck, Florian; Heinze, Peter E.; Maaß, Ulrike; Kühne, Franziska

doi:10.1007/s10608-023-10445-x

Competence Ratings in Psychotherapy Training – A Complex Matter

Original Article
Open access
Published: 11 October 2023

(2023)
Cite this article

Download PDF

You have full access to this open access article

Cognitive Therapy and Research Aims and scope Submit manuscript

Competence Ratings in Psychotherapy Training – A Complex Matter

Download PDF

Tatjana Paunov ORCID: orcid.org/0009-0007-7184-9473¹,
Florian Weck¹,
Peter E. Heinze¹,
Ulrike Maaß¹ &
…
Franziska Kühne¹

871 Accesses
8 Altmetric
1 Mention
Explore all metrics

Abstract

Background

The reliable assessment of competence is crucial for promoting the professional development of therapists. However, competence assessments are rarely included in training and research as these procedures are resource-intensive and costly, commonly relying on independent raters with high levels of expertise and extensive training. This study aimed to compare the interrater reliability (IRR) of raters with different levels of expertise. We also examined the impact of different camera perspectives on IRR.

Methods

We examined the IRR of six independent raters based on competence assessments in a standardized setting. Two raters were experienced psychotherapists (experts), and four were psychology students (novices; with/without supervision). All raters evaluated N = 359 videos of students performing role plays with standardized patients who were simulating depressive symptoms and behavior. For each video, the raters independently assessed basic communication skills (Clinical Communication Skills Scale–Short Form; CCSS-S), psychotherapeutic competence (Cognitive Therapy Scale; CTS), empathy (Empathy Scale; ES) and therapeutic alliance (Helping Alliance Questionnaire; HAQ).

Results

IRR varied depending on rater expertise and assessment measures, with the lowest intraclass correlation coefficients (ICCs) for empathy (ES; ICCs = 0.39-0.67) and the highest ICCs for psychotherapeutic competence (CTS; ICCs = 0.66-0.78). The concordance between expert raters and supervised novice raters was good (ICCs = 0.71-0.86). The camera perspective did not influence the reliability of the ratings.

Conclusions

With appropriate training and regular supervision, the novices assessed therapeutic behavior in standardized role plays with reliability comparable to that of the experts. Further research is needed regarding the reliable assessment of more complex therapy situations.

How Reliable Are Therapeutic Competence Ratings? Results of a Systematic Review and Meta-Analysis

Article 16 November 2019

Evaluating CBT Clinical Competence with Standardised Role Plays and Patient Therapy Sessions

Article Open access 24 May 2019

Revisiting How We Assess Therapist Competence in Cognitive Therapy

Article 29 March 2018

In recent decades, there has been a shift from knowledge-based to competency-based assessments of psychotherapy skills requiring not only the completion of a certain amount of coursework but also the application of skills in practice (Fouad et al., 2009; Kaslow et al., 2007; Kring et al., 2022). Therapist competence may thus be conceptualized as “the degree to which a therapist demonstrates the general therapeutic and treatment-specific knowledge and skills required to appropriately deliver cognitive-behavioral therapy (CBT) interventions which reflect the current evidence base for treatment of the patient’s presenting problem” (Muse & McManus, 2013, p. 485). In a similar vein, Barber and colleagues (2007) define competence as “the judicious application of communication, knowledge, technical skills, clinical reasoning, emotions, values, and contextual understanding for the benefit of the individual and community being served” (p. 494).

There are several reasons for the assessment of psychotherapy competence, as these conceptualizations underline the practical application of therapeutic knowledge and skills. First, the assessment of therapist competence is essential for facilitating trainees’ skill acquisition and fostering therapists’ continued professional development (Muse et al., 2022; Weck et al., 2021). Accordingly, the measurement of therapist competence is especially useful for trainees to obtain feedback on their performance and to develop their skills as therapists (Muse & McManus, 2013). Second, competence assessments play a critical role in the empirical evaluation of CBT because “research trials cannot draw valid conclusions regarding the efficacy of CBT protocols unless the competence with which the protocols are delivered can be established” (Muse & McManus, 2013, p. 485). Third, therapist competence has been discussed as a moderator of treatment outcomes, indicating that competence assessments could foster optimal CBT effectiveness for patients (Kuyken & Tsivrikos, 2009; Power et al., 2022; Strunk et al., 2010; Zarafonitis-Müller et al., 2014). In summary, the assessment of psychotherapeutic competence seems beneficial for research trials, professional training and clinical care.

Assessment of Therapist Competence

For competence assessments, the Cognitive Therapy Scale (CTS; Young & Beck, 1980) and its revised version (CTS-R; Blackburn et al., 2001) have been widely used in empirical studies representing a ‘benchmark’ for clinical skill assessment (Kühne et al., 2020; Muse et al., 2022). It has been shown that with appropriate training and monitoring, raters’ assessments of therapist competence using the CTS or its revised version are reliable (Kazantzis et al., 2018). The CTS has been applied not only to rate encounters with clinical patients but also as a tool to assess the competence of trainees in standardized role play settings (Kühne et al., 2022). In the context of training future therapists and assessing their skills, standardized role plays have been proposed as a promising and valid tool to measure therapist competence while reducing variability in patient presentation (Fairburn & Cooper, 2011; Kaslow et al., 2007; Muse & McManus, 2013). Standardized role play settings involve standardized patients (SPs) who are trained and provided with in-depth role scripts and implement their role play following clinical scenarios (Kühne et al., 2022). If trained comprehensively, SPs have been shown to act authentically, with trainees being unable to differentiate real patients from SPs (Ay-Bryson et al., 2022). Standardized role plays also involve therapists, i.e., trainees interacting with the SP conducting elements of a therapy session or a specific therapeutic task (Fairburn & Cooper, 2011). Evaluations based on such standardized situations (i.e., standardized role plays) might be particularly suitable for comparing competence assessments by different raters.

Perspectives on the Assessment of Therapist Competence

Different perspectives for skill assessment have been proposed, including assessments performed by patients, therapists, supervisors and independent judges (Weck, 2013). Several studies have compared therapists’, supervisors’ and/or independent judges’ ratings of competence. Most studies have reported that therapists overestimate their competence in self-ratings compared to independent judges (Caron et al., 2020; Dennhag et al., 2012a; Rozek et al., 2018), while one study reported underestimation (McManus et al., 2012). In a similar vein, supervisors’ agreement with independent judges’ ratings has been reported to be fairly low (Caron et al., 2020; Dennhag et al., 2012a). Hence, supervisors as well as therapists assess competence subjectively. Independent raters evaluating audio or video recordings of therapy sessions are thus considered more objective and a ‘gold standard’ for therapist assessment (Muse et al., 2022; Weck, 2013). However, all perspectives may provide valuable insights into the therapeutic process (Muse et al., 2022).

Whether therapists’ self-ratings or supervisors’ or independent judges’ ratings are used, it is necessary to ensure the reliability of their assessments (Roth & Pilling, 2007; Waltz et al., 1993). Reliability has been shown to vary greatly, most likely depending on raters’ degree of clinical expertise and the amount of training needed to achieve reliable assessments (Kühne et al., 2020; Muse & McManus, 2013; Weck, 2013). Rater training and the rating procedure are criticized for being cost and resource intensive, yet it is assumed that raters without expertise and training are unlikely to reliably assess therapist competence (Muse & McManus, 2013). The discussion about the expertise needed to evaluate therapeutic competence is ongoing. Some experts argue that competence assessment “is a complex skill that needs to be fostered through experiential training and requires experience and expertise” (Muse & McManus, 2016, p. 254). However, experimental studies comparing novice raters, i.e., trained psychology students, and expert raters, i.e., trained therapists, showed that novices were able to evaluate competence with satisfactory to high reliability (Weck et al., 2011; Weck, Weigel, Weck et al., 2011a, b, c). Novices’ assessments might be especially adequate when the material refers to basic therapeutic tasks such as psychoeducation (Weck, Weigel, Weck et al., 2011a, b, c) or when evaluating highly standardized settings. Considering cost-efficiency and availability, novice raters may thus be an alternative to expert raters, thereby reducing barriers to the implementation of competence assessments for teaching and research purposes.

Factors Influencing the Reliability of Competence Assessments

In addition to clinical expertise and training, other factors, such as the number of raters (Weck, 2013), the number of sessions rated per patient (Dennhag et al., 2012b) or whether session segments vs. entire sessions are rated (Weck et al., 2011), have been considered to influence reliability. While a recent meta-analysis revealed that none of these moderators were significant, larger samples are needed to determine their actual importance (Kühne et al., 2020). Furthermore, other underresearched factors, such as the camera perspective as a practical component of recordings, might also influence reliability. Typically, patients and therapists are recorded from the side, which reduces the possibility of watching their facial expressions in detail, thereby limiting the availability of information. However, research on oncology patients has shown that the camera focus may influence spectators’ engagement with video vignettes and their perceived realism (Visser et al., 2018).

The Current Study

While research on competency-based assessments of psychotherapy skills has gained momentum, some limitations and unanswered questions remain. As criticized by a meta-analysis on interrater reliability (IRR; Kühne et al., 2020), previous studies have used small samples of less than 30 tapes, limiting their statistical power. As studies on therapist competence ratings are often a byproduct of other main trials (Kühne et al., 2020), standards for reporting these ratings are not always met, resulting in a lack of information on rater training, feedback or IRR. Moreover, few studies have compared independent raters with different levels of expertise (Weck et al., 2011; Weck, Weigel, Weck et al., 2011a, b, c), and no study has yet examined the possible influence of the camera perspective.

The objective of this study was to compare the IRR of expert and novice raters considering their level of expertise. The ratings were based on video recordings of standardized role play interactions including SPs and predefined CBT tasks in a laboratory setting. Therefore, this study had four aims:

1.
to compare experts’ and novices’ mean ratings of trainee therapist competence,
2.
to determine the IRR of expert and novice raters,
- Hypothesis 1
  
  We expected good IRR of expert and novice raters with ICCs ≥ 0.75.
- Hypothesis 2
  
  We expected expert raters to show significantly higher reliability in the assessment of more complex skills, i.e., psychotherapeutic competence and therapeutic alliance but not in the assessment of basic counseling skills and empathy.
- Hypothesis 3
  
  We expected good concordance (ICCs ≥ 0.75) between expert and novice raters.
3.
to examine whether different camera perspectives (i.e., standard perspective vs. three perspectives) influenced IRR, and
4.
to explore IRR development with an increasing number of rated videos.

Methods

Study Overview

Data for the current study were drawn from a randomized controlled trial on modeling CBT skills as a key strategy for increasing therapeutic competence (Kühne et al., 2022). Sixty-nine trainees were randomly assigned to an intervention group (IG, manual reading plus modeling) and a control group (CG, manual reading). Trainees participated in two role plays before training (T0), two role plays after training (T1) and two role plays three months later (T2) treating trained SPs. In both groups, training included reading a manual instruction on behavioral activation and exploring cognitive bias based on an evidence-based CBT depression manual (Hautzinger, 2013). Participants had 20 min to read the manual instructions. In the intervention group, trainees additionally watched a video of an experienced licensed cognitive-behavioral psychotherapist who skillfully demonstrated behavioral activation and exploring cognitive bias with an SP. Each video lasted 20 min. Participants were instructed to implement what they had read or watched during the role plays. Participants were given 20 min to complete the role plays. All role plays were video recorded. The trainees were psychology students from the University of Potsdam with a mean age of 25.58 years (SD = 6.43) and were mainly female (81.2%) and Caucasian (81.2%). The majority were undergraduate students (84.1%) and did not have previous experience in treating mental health patients (69.6%). Participants had not received specific CBT training before. The SPs were seven undergraduate students from disciplines other than psychology. Their mean age was 22.29 years (SD = 2.14), six were female, and all were Caucasian. Prior to the role play sessions, the SPs were trained by a licensed psychotherapist and participated in a 12-hour workshop on portraying symptoms of depression. The SPs were blinded to the experimental conditions.

In contrast to the main study, the current study focused on the reliability of the competence ratings of three rater dyads. The current study used all available videos regardless of the time the video was recorded (pre/post intervention, follow-up). In total, the raters evaluated 359 videos each.

Expert and Novice Raters

Three rater dyads participated in this study: the Expert, Novice A, and Novice B dyads. All raters were female and Caucasian. The characteristics of and differences between rater dyads are shown in Table 1. All raters participated in the same study workshop and received training on the use of the instruments by two licensed therapists (FK, FW). Throughout the workshop, the raters started using the rating instruments with an example video, discussed their ratings and had the opportunity to ask questions on the rating procedure and the study in general. All raters made their ratings independently. Furthermore, all raters were blinded to the study condition, timing of the video (pre/post intervention, follow-up) and trainee’s previous experience. It took approximately 20 min to watch and assess one video. The ratings were conducted over a period of 12 months. In sum, each rater assessed 359 videos. To ensure reliability and to counteract rater drift, all raters participated in six additional 1 h rater trainings throughout the study. For all rater trainings, ratings were only discussed and not changed afterward.

Table 1 Rater characteristics

Full size table

Camera Perspective

The Expert and Novice B dyads watched all videos from a lateral perspective, which focused on the bodies and faces of the SPs and the study therapists from the side (i.e., standard perspective). The Novice A dyad only watched half of the videos from this standard perspective. The other half of the videos were rated by watching a screen showing three perspectives at the same time, including (a) the standard perspective described above, (b) a frontal perspective that solely focused on the SP, and (c) a frontal perspective that solely focused on the trainee therapist (see Fig. 1). The latter option provided the raters with a potentially broader range of information and details (e.g., facial reactions, empathic understanding).

Measures

Counseling Skills

For the assessment of basic communication skills, the German version of the Clinical Communication Skills Scale – Short Form (CCSS-S; Maaß et al., 2022) was used. The CCSS-S is a 14-item measure using a 4-point scale ranging from 0 (not at all appropriate) to 3 (entirely adequate). An example item is “The therapist uses open questions to motivate the patient to talk”. Maaß et al. (2022) report moderate to good interrater reliabilities (ICC_(2,2) = 0.65 − 0.80) and high correlations with other competence rating scales (rs > 0.86 − 0.89).

Psychotherapeutic Competence

For the assessment of psychotherapeutic competence, the German version of the Cognitive Therapy Scale (CTS; Weck et al., 2010) combining items of the CTS (Young & Beck, 1980) and the CTS-R (Blackburn et al., 2001) was used. The German CTS is a 14-item measure using a 7-point rating scale ranging from 0 (poor) to 6 (excellent). For the current study, only 11 items were used as three items could not be rated due to the specific task trainees had received in this study: (2) dealing with problems/questions/objections, (3) clarity of communication, (4) pacing and efficient use of time, (5) interpersonal effectiveness, (6) resource activation, (8) using feedback and summaries, (9) guided discovery, (10) focusing on central cognitions and behavior, (11) rationale, (13) appropriate implementation of techniques, and (14) assigning homework. The internal consistency for the German CTS was good in one study (α = 0.86; Weck et al., 2010) and a previous meta-analysis reported fair to excellent interrater reliability across several studies, ICCs = 0.42 − 0.97 (Kühne et al., 2020).

Therapeutic Alliance

For the evaluation of therapeutic alliance, the German version of the Helping Alliance Questionnaire (HAQ; Bassler et al., 1995; Luborsky, 1984) was used. The HAQ is an 11-item measure using a 6-point scale ranging from 1 (strongly disagree) to 6 (strongly agree). The wording was changed to reflect that the raters had to evaluate the relationship between the therapist and patient. An example item is “I believe the patient is working together with the therapist in a joint effort”. The interrater reliability of the mean HAQ score was satisfactory in a previous study, ICC_(2,2) = 0.73; p < .001 (Richtberg et al., 2016).

Empathy

For the evaluation of therapist empathy, the German version of the Empathy Scale (Partschefeld et al., 2013; Persons & Burns, 1985) was used. The ES is a 10-item measure using a 4-point scale ranging from 1 (strongly disagree) to 4 (strongly agree). The wording was changed to reflect that raters had to evaluate the therapist’s behavior. An example item is “The things the therapist says and does makes the patient feel they can trust the therapist”. Partschefeld et al. (2013) report good internal consistency (α = 0.84 − 0.89) and good interrater reliability (ICCs = 0.82 − 0.85). Due to their workload, the expert raters evaluated empathy for only 200 videos.

Data Analysis

Descriptive statistics were obtained for all outcome measures. As we first aimed to investigate mean differences in competence assessments on all measures (CCSS-S, CTS, ES, and HAQ), we compared the means of all rater dyads using analyses of variance and post hoc tests (Tukey’s HSD). Regarding our second aim, to determine the IRR of novice and expert raters, we computed ICCs for the rater dyads. Following Shrout and Fleiss (1979), a two-way random effects model (absolute agreement) using the mean of two raters (ICC_(2,2)) was computed. As proposed by Koo and Li (2016), ICCs between 0.50 and 0.75 were considered moderate and those of 0.75 and higher were considered good. Following these guidelines, we expected good IRR (Hypothesis 1). In order to compare ICCs between experts and novices, Fisher Z-transformation was used (McGraw & Wong, 1996; Hypothesis 2). Concordance between expert and novice raters was calculated using ICCs based on mean expert and mean novice ratings (Hypothesis 3). Regarding our third aim, to examine the influence of camera perspective on IRR, Fisher Z-transformation was used (McGraw & Wong, 1996). Fourth, we descriptively analyzed the IRR development based on the number of rated videos.

All analyses were performed using SPSS Statistics 29 (IBM Corp, 2022). The percentage of missing data ranged from 0 to 1.7% for the CCSS-S only and was handled using listwise deletion. Power calculations indicated that given a minimal accepted reliability of an ICC = 0.60, an expected reliability of an ICC = 0.75, an α of 0.05, a 1-β of 0.80 (power) and k = 2 raters, a total of 114 videos for every dyad of raters would have been necessary (Arifin, 2023).

Results

Mean Competence Ratings

First, we investigated the differences between raters’ mean competence assessments using ANOVA. The results and post hoc tests for all rater dyads are depicted in Table 2. The post hoc tests showed that the Expert raters gave significantly lower competence scores than the Novice B dyad on the CTS, CCSS-S and HAQ. The Expert raters also gave significantly lower competence scores than the Novice A dyad on the HAQ but not on the CTS or the CCSS-S. For the ES, the assessments of the Expert dyad were significantly better than those of the Novice A and Novice B dyads. The Novice A dyad gave significantly lower competence scores than the Novice B dyad on all measures.

Interrater Reliability

Table 3 presents the IRR and confidence intervals of the rater dyads for all measures. Our first hypothesis, according to which both expert and novice raters would be able to evaluate competence with good IRR, was mostly not supported. The expert dyads’ mean ICCs were good for the CCSS-S and moderate for the CTS, HAQ and ES. Regarding the Novice A dyad, the mean ICCs were good for the CTS and HAQ and moderate for the CCSS-S and ES. For the Novice B dyad, mean ICCs were moderate to poor across all measures.

Table 2 Comparison of mean competence ratings

Full size table

Table 3 Interrater reliabilities for expert and novice competence ratings

Full size table

Our second hypothesis, according to which experts would assess more complex therapeutic competence (i.e., CTS, HAQ) but not empathy (ES) or basic counseling skills (CCSS-S) with significantly higher reliability than novices, was not supported. Contrary to our hypothesis regarding the CTS, there was no difference in IRR between the Expert and the Novice B dyad (z = 1.53, p = .063), and the ratings were significantly less reliable in the Expert dyad than in the Novice A dyad (z = -1.84, p < .033). Similarly, for the HAQ, the ratings were significantly less reliable in the Expert dyad than in the Novice A dyad (z = -3.52, p < .001). For the CCSS-S, the assessments were more reliable in the Expert dyad than in the Novice A dyad (z = 3.00, p < .001) but not in the Novice B dyad (z = 1.50, p = .067). Contrary to our hypothesis, for the ES, differences also showed significance, with higher reliability in the Expert dyad than in the Novice B dyad (z = 4.50, p < .001).

According to our third hypothesis, we expected good concordance between mean expert and novice ratings. This was supported only for the Expert and Novice A dyads. In contrast, the concordance between the Expert and Novice B dyads was only moderate (see Table 3).

Camera Perspective

To address the third aim of this study regarding the influence of the camera perspective on IRR, we compared the ICCs of all measures based on camera perspective (i.e., standard perspective vs. three perspectives) for the Novice A dyad. No significant differences were detected (all ps > 0.05; see Supplementary material 1).

Explorative Analysis

According to our fourth aim, we explored IRR development descriptively based on the number of rated videos. The exemplary IRR development for the CTS is shown in Fig. 2. The results regarding the other measures are reported in Supplement 2.

The ICCs of all rater dyads varied considerably over the course of the ratings. For the Expert dyad, agreement improved after approximately 140 video ratings but decreased afterward. Agreement improved for the Novice A dyad after 140 videos and continued to improve, reaching a stable level after 200 videos. The agreement of the Novice B dyad gradually decreased with an increasing number of rated videos.

Discussion

This study investigated the IRR of expert and novice raters for the basic counseling skills (CCSS-S), psychotherapeutic competence (CTS), empathy (ES) and therapeutic alliance (HAQ) of trainee therapists in a standardized setting. Our results must be discussed considering that measurement quality is largely influenced by rater-, instrument- and sample-related factors (Kottner et al., 2011).

First, we aimed to determine the IRR of expert and novice raters, expecting good IRR. In contrast to our hypothesis, the IRR was moderate rather than good, with the Expert and Novice A dyads achieving the highest IRR and the Novice B dyad achieving moderate to poor IRR. As the Novice B dyad did not receive supervised rater trainings but discussed ratings on their own, their insufficient IRR might be a direct result of the lack of training and supervision. In addition, it must be assumed that even highly experienced and trained raters are, to some degree, subject to several forms of errors and biases (Eckes & Jin, 2021, 2022). Rater effects such as severity/leniency bias or halo effects may thus have had an impact on our results, as well as rater characteristics such as agreeableness (Ceh et al., 2022). In this way, some of our results may be due to systematic rater effects rather than differences between training or clinical experience, especially given the low number of raters involved. Furthermore, we observed that the ICCs of two rater dyads dropped over the course of the ratings. This is line with findings showing that the reliability of ratings tends to plateau when evaluating one trainee for a certain amount of time (Kring et al., 2022). This might indicate that the quality of the ratings does not necessarily improve with higher quantity. Although the raters in the current study were advised to pause during the ratings, they may have experienced fatigue or loss of motivation due to repeatedly watching and evaluating videos with the same content (Suess & Schmiedeck, 2000). Future studies should therefore pay attention to rater effects (such as response bias and halo effects) and involve more raters to minimize these effects (Dennhag et al., 2012a, b). Additionally, future research could include other perspectives on competence assessment, such as those of supervisors or therapists themselves (Muse et al., 2022).

Second, we expected the IRR on complex psychotherapeutic competences of experts to be higher than that of novices. Interestingly, the Novice A dyad achieved significantly higher reliability than the Expert dyad for psychotherapeutic competence and therapeutic alliance. This is especially intriguing given that the experts had both higher rating expertise and higher experience in conducting CBT. Our results thus indicate that rating expertise and therapeutic experience do not necessarily ensure high IRR. As our results also showed great heterogeneity in ICCs with no rater dyad achieving consistently high reliability on all measures, aspects related to instruments need to be discussed as well. The highest IRR across rater dyads occurred for the CTS and CCSS-S, which is in line with previous studies using comparably trained and experienced raters in standardized settings (Alpers & Hengen, 2021). Of note, our results are only based on selected items of the CTS, which may has influenced validity. Inconsistent and low ICCs appeared, particularly for the ES and HAQ. For empathy assessments, the Expert dyad gave significantly higher mean ratings than novice dyads but the IRR of all raters was only moderate. Relatedly, the assessment of the Expert dyad as well as the Novice B dyad for therapeutic alliance was only moderate. It is possible that novice raters might be less familiar with overarching therapy concepts (such as therapeutic alliance), making their assessment difficult. On the other hand, low ICCs might also be due to specific characteristics of the scales, such as the ambiguous phrasing of items. This is especially true for scales or items asking raters to evaluate higher level constructs and a therapist’s general understanding of the patient (Schmidt et al., 2018). Future studies should therefore discuss raters’ understanding of the assessed concepts throughout their training and use scales with verbal anchors defining the meaning of each or some points (Weck et al., 2010). Our results moreover point out the need for rating scales with more specific and observable criteria.

Third, we aimed to determine the concordance between expert and novice raters. In line with our hypothesis, IRR was good for the Expert and Novice A dyads. Concordance between the Expert and Novice B dyad was only moderate with wide confidence intervals. Raters in our study were trained thoroughly and received regular training by the same two supervisors, ensuring that all raters based their ratings on the same information and guidelines. Only the Novice B dyad did not receive supervised training and discussed ratings on their own. This might explain the large difference between their mean evaluations and those of the other rater dyads; the Novice B dyad gave significantly higher competence ratings and had low concordance with the Expert dyad. While for the most part, previous studies reported the inclusion of regular meetings for raters to discuss their ratings (Dittmann et al., 2017; Karterud et al., 2013; Weck et al., 2011a), we suggest that additional supervision in such meetings could further improve the rating process.

Fourth, we aimed to explore the influence of camera perspective on IRR. We did not find significant differences based on camera perspective. These results are in line with a meta-analysis pointing out that across IRR studies on competence assessments, no moderators could be identified (Kühne et al., 2020). In fact, empirical evidence for moderators of competence ratings remains scarce (Muse et al., 2022). Although we did not find any effects of the camera perspective on which the competence ratings were based, future research should continue to investigate possible moderators of IRR. Studies using large samples are needed to detect moderators such as rater training or biases.

This study also adds to the knowledge about competence assessments of trainees at an early skill acquisition stage. We thereby tapped into a particularly sensitive stage of competence development, in which reliable competence assessments are essential for further skill consolidation (Muse et al., 2022). However, the competence ratings were limited to the assessment of student trainees based on interactions in a standardized setting using well-structured and predefined tasks as well as SPs. This limits the data’s generalizability and reduces external validity. These circumstances may offer an explanation for the good IRR of the Novice A dyad as well as their high concordance with the Expert dyad. Previous studies using naturalistic samples and more experienced therapists reported lower ICCs for novices and lower concordance between experts and novices (Weck et al., 2011; Weck, Weigel, Weck et al., 2011a, b, c). While the applicability to more naturalistic settings may be criticized, such a standardized approach has proven useful, indicating that at least for adherence ratings fewer standardized assessments are needed to achieve a reliable estimate (Imel et al., 2014). Subsequent studies should therefore include competence ratings derived from naturalistic settings and more experienced therapists to complement our findings (Kazantzis et al., 2018). In practice, future research could examine how trainees respond to feedback provided by novices compared to experts.

Conclusion

In this study, we explored ways to improve competence measurements. We contributed to competence and training research by using trained independent raters, controlling for rater drift and comparing raters with different levels of expertise and training. As a strength of the current study, all raters assessed over 350 videos. To date, only a few studies have used a similar number of videos, indicating great variability in IRR (Barber et al., 2004; Dennhag et al., 2012a, b). Moreover, we implemented several measures for competence assessments, as any single scale is unlikely to provide a comprehensive measure of therapist competence (Muse et al., 2022). Our findings are, however, limited to the role play setting of this study using well-structured and predefined tasks as well as SPs. We suggest that trained novices such as psychology students or therapist trainees may assess psychotherapeutic competencies in standardized situations with adequate reliability, especially when supervised and using well-established, unambiguous scales. Our findings, however, indicate that the IRR of novice raters might be highly dependent on their training and supervision throughout their ratings. While including novice raters might reduce barriers to the implementation of competence assessments, their inclusion could be limited to standardized settings or basic therapeutic tasks. In summary, this study underscores the importance and complexity of training future therapists.

References

Alpers, G. W., & Hengen, K. M. (2021). Interactions with standardized patients to evaluate students’ psychotherapy competencies. Zeitschrift Für Klinische Psychologie Und Psychotherapie, 50(3–4), 133–144. https://doi.org/10.1026/1616-3443/a000636.
Article Google Scholar
Arifin, W. N. (2023). Sample size calculator [Computer software]. https://wnarifin.github.io/ssc/ssicc.html.
Ay-Bryson, D. S., Weck, F., & Kühne, F. (2022). Can simulated patient encounters appear authentic? Development and pilot results of a rating instrument based on the portrayal of depressive patients. Training and Education in Professional Psychology, 16(1), 20–27. https://doi.org/10.1037/tep0000349.
Article Google Scholar
Barber, J. P., Foltz, C., Crits-Christoph, P., & Chittams, J. (2004). Therapists’ adherence and competence and treatment discrimination in the NIDA Collaborative Cocaine Treatment Study. Journal of Clinical Psychology, 60(1), 29–41. https://doi.org/10.1002/jclp.10186.
Article PubMed Google Scholar
Barber, J. P., Sharpless, B. A., Klostermann, S., & McCarthy, K. S. (2007). Assessing intervention competence and its relation to therapy outcome: A selected review derived from the outcome literature. Professional Psychology: Research and Practice, 38(5), 493–500. https://doi.org/10.1037/0735-7028.38.5.493.
Article Google Scholar
Bassler, M., Potratz, B., & Krauthauser, H. (1995). The Helping Alliance Questionnaire (HAQ) by Luborsky. Psychotherapeut, 40, 23–32.
Google Scholar
Blackburn, I. M., James, I. A., Milne, D. L., Baker, C., Standart, S., Garland, A., & Reichelt, F. K. (2001). The revised cognitive therapy scale (CTS-R): Psychometric properties. Behavioural and Cognitive Psychotherapy, 29(4), 431–446. https://doi.org/10.1017/S1352465801004040.
Article Google Scholar
Caron, E. B., Muggeo, M. A., Souer, H. R., Pella, J. E., & Ginsburg, G. S. (2020). Concordance between clinician, supervisor and observer ratings of therapeutic competence in CBT and treatment as usual: Does clinician competence or supervisor session observation improve agreement? Behavioural and Cognitive Psychotherapy, 48(3), 350–363. https://doi.org/10.1017/S1352465819000699.
Article CAS PubMed Google Scholar
Ceh, S. M., Edelmann, C., Hofer, G., & Benedek, M. (2022). Assessing raters: What factors predict discernment in novice creativity raters? The Journal of Creative Behavior, 56(1), 41–54. https://doi.org/10.1002/jocb.515.
Article Google Scholar
Dennhag, I., Gibbons, M. B. C., Barber, J. P., Gallop, R., & Crits-Christoph, P. (2012a). Do supervisors and independent judges agree on evaluations of therapist adherence and competence in the treatment of cocaine dependence? Psychotherapy Research, 22(6), 720–730. https://doi.org/10.1080/10503307.2012.716528.
Article PubMed PubMed Central Google Scholar
Dennhag, I., Gibbons, M. B. C., Barber, J. P., Gallop, R., & Crits-Christoph, P. (2012b). How many treatment sessions and patients are needed to create a stable score of adherence and competence in the treatment of cocaine dependence? Psychotherapy Research, 22(4), 475–488. https://doi.org/10.1080/10503307.2012.674790.
Article PubMed PubMed Central Google Scholar
Dittmann, C., Müller-Engelmann, M., Stangier, U., Priebe, K., Fydrich, T., Görg, N., Rausch, S., Resick, P. A., & Steil, R. (2017). Disorder- and treatment-specific therapeutic competence scales for posttraumatic stress disorder intervention: Development and psychometric properties. Journal of Traumatic Stress, 30(6), 614–625. https://doi.org/10.1002/jts.22236.
Article PubMed Google Scholar
Eckes, T., & Jin, K. Y. (2021). Measuring rater centrality effects in writing assessment: A bayesian facets modeling approach. Psychological Test and Assessment Modeling, 63, 65–94.
Google Scholar
Eckes, T., & Jin, K. Y. (2022). Detecting illusory halo effects in rater-mediated assessment: A mixture rasch facets modeling approach. Psychological Test and Assessment Modeling, 64, 87–111.
Google Scholar
Fairburn, C. G., & Cooper, Z. (2011). Therapist competence, therapy quality, and therapist training. Behaviour Research and Therapy, 49(6–7), 373–378. https://doi.org/10.1016/j.brat.2011.03.005.
Article PubMed PubMed Central Google Scholar
Fouad, N. A., Grus, C. L., Hatcher, R. L., Kaslow, N. J., Hutchings, P. S., Madson, M. B., Collins, F. L., & Crossman, R. E. (2009). Competency benchmarks: A model for understanding and measuring competence in professional psychology across training levels. Training and Education in Professional Psychology, 3(4), 5–26. https://doi.org/10.1037/a0015832.
Article Google Scholar
Hautzinger, M. (2013). Kognitive Verhaltenstherapie bei Depression [Cognitive behavioural therapy for depression] (7th ed.). Beltz.
IBM Corp (2022). IBM SPSS Statistics for Windows (Version 29.0) [Computer software].
Imel, Z. E., Baldwin, S. A., Baer, J. S., Hartzler, B., Dunn, C., Rosengren, D. B., & Atkins, D. C. (2014). Evaluating therapist adherence in motivational interviewing by comparing performance with standardized and real patients. Journal of Consulting and Clinical Psychology, 82(3), 472–481. https://doi.org/10.1037/a0036158.
Article PubMed PubMed Central Google Scholar
Karterud, S., Pedersen, G., Engen, M., Johansen, M. S., Johansson, P. N., Schlüter, C., Urnes, O., Wilberg, T., & Bateman, A. W. (2013). The MBT adherence and competence scale (MBT-ACS): Development, structure and reliability. Psychotherapy Research, 23(6), 705–717. https://doi.org/10.1080/10503307.2012.708795.
Article PubMed Google Scholar
Kaslow, N. J., Rubin, N. J., Bebeau, M. J., Leigh, I. W., Lichtenberg, J. W., Nelson, P. D., Portnoy, S. M., & Smith, I. L. (2007). Guiding principles and recommendations for the assessment of competence. Professional Psychology: Research and Practice, 38(5), 441–451. https://doi.org/10.1037/0735-7028.38.5.441.
Article Google Scholar
Kazantzis, N., Clayton, X., Cronin, T. J., Farchione, D., Limburg, K., & Dobson, K. S. (2018). The cognitive therapy scale and cognitive therapy scale-revised as measures of therapist competence in cognitive behavior therapy for depression: Relations with short and long term outcome. Cognitive Therapy and Research, 42(4), 385–397. https://doi.org/10.1007/s10608-018-9919-4.
Article Google Scholar
Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155–163. https://doi.org/10.1016/j.jcm.2016.02.012.
Article PubMed PubMed Central Google Scholar
Kottner, J., Audigé, L., Brorson, S., Donner, A., Gajewski, B. J., Hróbjartsson, A., Roberts, C., Shoukri, M., & Streiner, D. L. (2011). Guidelines for reporting reliability and agreement studies (GRRAS) were proposed. Journal of Clinical Epidemiology, 64(1), 96–106. https://doi.org/10.1016/j.jclinepi.2010.03.002.
Article PubMed Google Scholar
Kring, M., Cozart, J. K., Sinnard, M. T., Oby, A., Hamm, E. H., Frost, N. D., & Hoyt, W. T. (2022). Evaluating psychotherapist competence: Testing the generalizability of clinical competence assessments of graduate trainees. Journal of Counseling Psychology, 69(2), 222–234. https://doi.org/10.1037/cou0000576.
Article PubMed Google Scholar
Kühne, F., Meister, R., Maaß, U., Paunov, T., & Weck, F. (2020). How reliable are therapeutic competence ratings? Results of a systematic review and meta-analysis. Cognitive Therapy and Research, 44(2), 241–257. https://doi.org/10.1007/s10608-019-10056-5.
Article Google Scholar
Kühne, F., Heinze, P. E., Maaß, U., & Weck, F. (2022). Modeling in psychotherapy training: A randomized controlled proof-of-concept trial. Journal of Consulting and Clinical Psychology, 90(12), 950–956. https://doi.org/10.1037/ccp0000780.
Article PubMed Google Scholar
Kuyken, W., & Tsivrikos, D. (2009). Therapist competence, comorbidity and cognitive-behavioral therapy for depression. Psychotherapy and Psychosomatics, 78(1), 42–48. https://doi.org/10.1159/000172619.
Article PubMed Google Scholar
Luborsky, L. (1984). Principles of psychoanalytic psychotherapy: Manual for supportive-expressive psychotherapy. Basic Books.
Maaß, U., Kühne, F., Heinze, P. E., Ay-Bryson, D. S., & Weck, F. (2022). The concise measurement of clinical communication skills: Validation of a short scale. Frontiers in Psychiatry, 13, 977324. https://doi.org/10.3389/fpsyt.2022.977324.
Article PubMed PubMed Central Google Scholar
McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30–46. https://doi.org/10.1037/1082-989X.1.1.30.
Article Google Scholar
McManus, F., Rakovshik, S., Kennerley, H., Fennell, M., & Westbrook, D. (2012). An investigation of the accuracy of therapists’ self-assessment of cognitive-behaviour therapy skills. The British Journal of Clinical Psychology, 51(3), 292–306. https://doi.org/10.1111/j.2044-8260.2011.02028.x.
Article PubMed Google Scholar
Muse, K., & McManus, F. (2013). A systematic review of methods for assessing competence in cognitive-behavioural therapy. Clinical Psychology Review, 33(3), 484–499. https://doi.org/10.1016/j.cpr.2013.01.010.
Article PubMed Google Scholar
Muse, K., & McManus, F. (2016). Expert insight into the assessment of competence in cognitive-behavioural therapy: A qualitative exploration of experts’ experiences, opinions and recommendations. Clinical Psychology & Psychotherapy, 23(3), 246–259. https://doi.org/10.1002/cpp.1952.
Article Google Scholar
Muse, K., Kennerley, H., & McManus, F. (2022). The why, what, when, who and how of assessing CBT competence to support lifelong learning. The Cognitive Behaviour Therapist, 15. https://doi.org/10.1017/S1754470X22000502.
Partschefeld, E., Strauß, B., Geyer, M., & Philipp, S. (2013). Simulated patients in psychotherapy training. Psychotherapeut, 58(5), 438–445. https://doi.org/10.1007/s00278-013-1002-8.
Article Google Scholar
Persons, J. B., & Burns, D. D. (1985). Mechanisms of action of cognitive therapy: The relative contributions of technical and interpersonal interventions. Cognitive Therapy and Research, 9(5), 539–551. https://doi.org/10.1007/BF01173007.
Article Google Scholar
Power, N., Noble, L. A., Simmonds-Buckley, M., Kellett, S., Stockton, C., Firth, N., & Delgadillo, J. (2022). Associations between treatment adherence-competence-integrity (ACI) and adult psychotherapy outcomes: A systematic review and meta-analysis. Journal of Consulting and Clinical Psychology, 90(5), 427–445. https://doi.org/10.1037/ccp0000736.
Article PubMed Google Scholar
Richtberg, S., Jakob, M., Höfling, V., & Weck, F. (2016). Assessment of patient interpersonal behavior: Development and validation of a rating scale. Psychotherapy Research, 26, 106–119. https://doi.org/10.1080/10503307.2014.947391.
Article PubMed Google Scholar
Roth, A. D., & Pilling, S. (2007). The competences required to deliver effective cognitive and behavioural therapy for people with depression and with anxiety disorders. Department of Health.
Rozek, D. C., Serrano, J. L., Marriott, B. R., Scott, K. S., Hickman, L. B., Brothers, B. M., Lewis, C. C., & Simons, A. D. (2018). Cognitive behavioural therapy competency: Pilot data from a comparison of multiple perspectives. Behavioural and Cognitive Psychotherapy, 46(2), 244–250. https://doi.org/10.1017/S1352465817000662.
Article PubMed Google Scholar
Schmidt, I. D., Strunk, D. R., DeRubeis, R. J., Conklin, L. R., & Braun, J. D. (2018). Revisiting how we assess therapist competence in cognitive therapy. Cognitive Therapy and Research, 42(4), 369–384. https://doi.org/10.1007/s10608-018-9908-7.
Article Google Scholar
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428. https://doi.org/10.1037//0033-2909.86.2.420.
Article CAS PubMed Google Scholar
Strunk, D. R., Brotman, M. A., DeRubeis, R. J., & Hollon, S. D. (2010). Therapist competence in cognitive therapy for depression: Predicting subsequent symptom change. Journal of Consulting and Clinical Psychology, 78(3), 429–437. https://doi.org/10.1037/a0019631.
Article PubMed PubMed Central Google Scholar
Suess, H. M., & Schmiedeck, F. (2000). Fatigue and practice effects during cognitive tasks lasting several hours. Experimental Psychology, 47(3), 162–179. https://doi.org/10.1026//0949-3964.47.3.162.
Article Google Scholar
Visser, L. N. C., Bol, N., Hillen, M. A., Verdam, M. G. E., de Haes, H. C. J. M., van Weert, J. C. M., & Smets, E. M. A. (2018). Studying medical communication with video vignettes: A randomized study on how variations in video-vignette introduction format and camera focus influence analogue patients’ engagement. BMC Medical Research Methodology, 18(1), 15. https://doi.org/10.1186/s12874-018-0472-3.
Article PubMed PubMed Central Google Scholar
Waltz, J., Addis, M. E., Koerner, K., & Jacobson, N. S. (1993). Testing the integrity of a psychotherapy protocol: Assessment of adherence and competence. Journal of Consulting and Clinical Psychology, 61(4), 620–630. https://doi.org/10.1037//0022-006x.61.4.620.
Article CAS PubMed Google Scholar
Weck, F. (2013). Assessment of psychotherapeutic competencies. Springer. https://doi.org/10.1007/978-3-642-39366-2.
Weck, F., Hautzinger, M., Heidenreich, T., & Stangier, U. (2010). Assessment of psychotherapeutic competencies. Zeitschrift für Klinische Psychologie und Psychotherapie, 39(4), 244–250. https://doi.org/10.1026/1616-3443/a000055.
Article Google Scholar
Weck, F., Bohn, C., Ginzburg, D. M., & Stangier, U. (2011a). Assessment of adherence and competence in cognitive therapy: Comparing session segments with entire sessions. Psychotherapy Research, 21(6), 658–669. https://doi.org/10.1080/10503307.2011.602751.
Article PubMed Google Scholar
Weck, F., Hilling, C., Schermelleh-Engel, K., Rudari, V., & Stangier, U. (2011b). Reliability of adherence and competence assessment in cognitive behavioral therapy: Influence of clinical experience. The Journal of Nervous and Mental Disease, 199(4), 276–279. https://doi.org/10.1097/NMD.0b013e3182124617.
Article PubMed Google Scholar
Weck, F., Weigel, M., Richtberg, S., & Stangier, U. (2011c). Reliability of adherence and competence assessment in psychoeducational treatment: Influence of clinical experience. The Journal of Nervous and Mental Disease, 199(12), 983–986. https://doi.org/10.1097/NMD.0b013e3182392da1.
Article PubMed Google Scholar
Weck, F., Junga, Y. M., Kliegl, R., Hahn, D., Brucker, K., & Witthöft, M. (2021). Effects of competence feedback on therapist competence and patient outcome: A randomized controlled trial. Journal of Consulting and Clinical Psychology, 89(11), 885–897. https://doi.org/10.1037/ccp0000686.
Article PubMed Google Scholar
Young, J., & Beck, A. T. (1980). Cognitive therapy scale rating manual. [Unpublished manuscript]. Center of Cognitive Therapy, University of Pennsylvania.
Zarafonitis-Müller, S., Kuhr, K., & Bechdolf, A. (2014). The relationship between therapist’s competence and adherence to outcome in cognitive-behavioural therapy - results of a meta-analysis. Fortschritte der Neurologie-Psychiatrie, 82(9), 502–510. https://doi.org/10.1055/s-0034-1366798.
Article PubMed Google Scholar

Download references

Acknowledgements

We would like to thank all raters for their video ratings.

Funding

The main study was funded by the German Research Foundation (DFG; postmaster@dfg.de) through grants to FW (PI, WE 4654/10 − 1) and FK (PI, KU 3790/2 − 1).

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Department of Psychology, Clinical Psychology and Psychotherapy, University of Potsdam, Karl-Liebknecht-Str. 24-25, 14476, Potsdam, Germany
Tatjana Paunov, Florian Weck, Peter E. Heinze, Ulrike Maaß & Franziska Kühne

Authors

Tatjana Paunov
View author publications
You can also search for this author in PubMed Google Scholar
Florian Weck
View author publications
You can also search for this author in PubMed Google Scholar
Peter E. Heinze
View author publications
You can also search for this author in PubMed Google Scholar
Ulrike Maaß
View author publications
You can also search for this author in PubMed Google Scholar
Franziska Kühne
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Peter Heinze and Tatjana Paunov. The first draft of the manuscript was written by Tatjana Paunov and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Tatjana Paunov.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest.

Ethics approval and consent to participate

The main study was approved by the University of Potsdam ethics review committee (No. 9/2018) and by its data security officer.

Informed Consent

Written informed consent was obtained from all individual participants included in the study.

Research Involving Animal Rights

No animal studies were carried out for this study.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Paunov, T., Weck, F., Heinze, P.E. et al. Competence Ratings in Psychotherapy Training – A Complex Matter. Cogn Ther Res (2023). https://doi.org/10.1007/s10608-023-10445-x

Download citation

Accepted: 04 October 2023
Published: 11 October 2023
DOI: https://doi.org/10.1007/s10608-023-10445-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Competence Ratings in Psychotherapy Training – A Complex Matter

Abstract

Background

Methods

Results

Conclusions

Similar content being viewed by others

How Reliable Are Therapeutic Competence Ratings? Results of a Systematic Review and Meta-Analysis

Evaluating CBT Clinical Competence with Standardised Role Plays and Patient Therapy Sessions

Revisiting How We Assess Therapist Competence in Cognitive Therapy

Assessment of Therapist Competence

Perspectives on the Assessment of Therapist Competence

Factors Influencing the Reliability of Competence Assessments

The Current Study

Hypothesis 1

Hypothesis 2

Hypothesis 3

Methods

Study Overview

Expert and Novice Raters

Camera Perspective

Measures

Counseling Skills

Psychotherapeutic Competence

Therapeutic Alliance

Empathy

Data Analysis

Results

Mean Competence Ratings

Interrater Reliability

Camera Perspective

Explorative Analysis

Discussion

Conclusion

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval and consent to participate

Informed Consent

Research Involving Animal Rights

Additional information

Publisher’s Note

Electronic Supplementary Material

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation