Interactions with Standardized Patients to Evaluate Students ’ Psychotherapy-Competencies Reliable Assessment and Valid Evaluation

: The use of standardized patients (SPs) in the training of prospective practitioners is a well-established didactic tool in medical schools. Only recently have simulations of patients in psychotherapy been introduced into the training of psychologists. By integrating psychotherapy training into university-level master ’ s programs, German law now requires licensing exams for psychotherapists (i.e., Approbationsprüfung ) to include an assessment of therapeutic competencies in simulated interactions with SPs. Yet, it has not been examined whether these simulations are useful for a reliable assessment of competencies in psychotherapy trainees. Also, we need to develop standardized instruments to evaluate competencies in entry-level psychotherapists. As part of a university course, we trained master ’ s-level students from three cohorts in clinical interviewing techniques (course title: Klinisch-psychologische Gesprächsführung ). We analyzed videotaped 20-min sequencesof N = 104 students whiletheyinterviewedoneof N = 38trainedSPs.Thestudents ’ taskwas tointerview the SP, conduct a brief case history, and use the interviewing skills they had learned in class. Two independent raters evaluated their psychotherapeutic competencies with an adapted version of the German Cognitive Therapy Scale (CTS). Raters evaluated students ’ performance on two subscales and the total score with satisfactory interrater agreement (intraclass correlations). In general, students performed well in the interviews: They structured the sessions sufficiently, and their global psychotherapeutic competencies were satisfactory. However, the psychotherapeutic competencies of master ’ s students fell short of the benchmark derived from experienced psychotherapists. This pilot study provides first evidence that simulated interviews with SPs may be a reliable tool in the assessment of practical competencies in psychotherapy trainees at an early stage of their training. Moreover, we found that the CTS, which has demonstrated validity to quantify competencies of psychotherapists, is applicable and reliable in this training context as well. In sum, this suggests that simulated interviews with SPs may be useful for evaluating psychotherapeutic competencies of psychotherapy trainees. Zwei voneinander unabhängige Beurteiler bewerteten die Leistung der Studierenden mit-hilfe der deutschen Version der Cognitive Therapy Scale (CTS). Die Beurteilung der Leistung der Studierenden ließ sich auf zwei Subskalen und dem Gesamtwert mit zufriedenstellender Übereinstimmung (intraclass correlations) bewerkstelligen. Insgesamt führten die Studierenden fachlich adäquate Gespräche mit den Schauspielpatienten; sie strukturierten die Sitzung genügend und die gezeigten psychotherapeutischen Kompetenzen waren zufriedenstellend. Allerdings zeigte sich auch, dass die psychotherapeutischen Kompetenzen der Studierenden noch klar unter dem Maßstab erfahrenerer Psychotherapeuten lag. Diese Pilotstudie legt nahe, dass sich simulierte Gespräche mit Schauspielpatienten als reliables Mittel zur Erfassung praktischer Kompetenzen in einem frühen Stadium der universitären Therapieausbildung anbieten. Darüber hinaus fanden wir, dass eine adaptierte Version der CTS-Skala, deren Validität zur Bewertung von therapeutischen Kompetenzen schon gut belegt war, ebenfalls im universitären Ausbildungskontext reliabel anwendbar ist. Zusammengefasst zeigt dies, dass simulierte Gespräche mit Schauspielpatienten nützlich sein dürften, um die Kompetenzen von Kandidaten gemäß der neuen Approbationsprüfung zu bewerten.

Psychotherapyt raininge ntails learning aboutm odelso f psychopathologyand thetheoryofpsychological interventions. At thesametime, students need to acquirepractical skills to become therapists (Vogel &A lpers, 2009). German federal law recently restructured the training of psychotherapists (Der Bundestag,2019),the most important consequenceb eing that training is now integrated into university-based master'sprograms within established psychologydepartments. While university exams traditionally focusonthe assessment of knowledge, this reform now requires the assessment of therapeutic performance as well (Vogel &Alpers, 2009). As alogical consequence, new licensinge xams (Approbationsprüfung)r equire the formal assessmento fs tudents' practical skills. Highly specified licensing examsinclude around of interviews of each student with trained standardized patients (SPs) who simulate ar ealistic scenario foraclinical interview( socalled Parcourprüfung). Future licensing is thus preceded by ar eliable assessment of each candidate'sc ompetencies in theseclinical interviews. We are unaware of similar practical exams, since previously licensingexams focused either on the acquisition of knowledge or on case formulations. Because the didacticm ethod of SPs is relatively new to the field and formal assessments of competencies in such interviews have yet to be evaluated, we explored whether competencies can be evaluated based on interviews with SPs.
Psychotherapy training in Germany is regulated by federal law,a nd for the last 20 years, it has been carried out as highly structured on-the-job postgraduate training. Recently, new legislation further prescribed integrated academic training within psychology departments (Der Bundestag,2 019).I mportantly,t his recent development corresponds to international training models,inparticular the so-called scientist-practitioner model, which is well established in the United States (see Alpers &H ofmann, 2007). By closing the gap between research and practice, this reform also addressest he criticismt hatc urrent master'sp rograms in clinical psychology are mostly theoretically oriented (Bergold, 2008;Rief et al., 2012;Schiefele &Jacob-Ebbinghaus, 2006;Wentura et al., 2013).
Federal law now defines the relevant content of the masters' programsa nd specifies which competencies are required to pass licensing examsa fter their completion. This is justified by the need to guarantee compliance with patients' safety,a lthough graduates obtain their license much earlier than in the previous model of postgraduate training (Bundesministerium fürGesundheit,2020).Therefore,t he new master'sp rogram requires more practical training and consequentially the assessment of each graduate'stherapeutic performance (Rief et al., 2014).
Several didactic approaches have been implemented to include practical skills training in the course of psychology programsa nd psychotherapy training. Of course, sooner or later students should be exposed to real patients with real mental-health issues -the individuals they treat after graduation. Interacting with patients provides the most realistic impression of practical competencies as demanded to adequately perform psychotherapy.Internships and placements therefore form an integral part of any structured psychotherapy training. However,i ti sn ot always feasiblet oi nclude real patients in classroom teaching becauseo fp racticabilitya nd didactic hurdles. First, enormous costsa re involved fori nstructors andt heir patients; second,patientsare oftenhesitanttoparticipate in educational programs (see Alpers &Steiger-White,2020).
Much easier to implement in teaching are simulations of interactions with patients. Peer-to-peer role-playing has traditionally been used to enact psychotherapy in the classroom. Indeed,r ole-playing has as trong track record to illustrate relevanta spects of professional interaction; it has proven to prepare the students for real-life interaction with patients with mental-health issues (Bennett-Levy et al., 2009;Fairburn &C ooper,2 011;M itterhofer et al., 2011). In such role-playing,s tudentsa lternateb etweenthe role of the patient and that of the psychotherapist. Althoughp eer-to-peerr ole-playing is an important first step in practical training it suffers froms pecific limitations (Alpers &Steiger-White, 2020), and it is rather challenging to simulate authentic interactions. First, while studentsc oncentrate on their therapeutic role, they are typically not trained to authentically portray the patient role. Second, patient roles and their anamnestic history are usually developed ad-hoc. Third, and most importantly,p eer-to-peer role-playingt akes place in af amiliar environmentw ithp eers of similard emographic characteristics whomay know each other(in theirroleasastudent).
More effective are external actors, who can be instructed to enact the patient role. As their roles and enactment can be well rehearseda nd standardized, they are often referred to as SPs. Because they can be selected and trained to portray authentic patient roles,m odules that include such SPs can offer an intermediate step between peer-to-peerrole-playingand exposure to actual patients. Moreover,thanks to their high degree of standardization, they may be more suitable for examst han role-playing with other students as well as interactions with real patients.
The generalapproach to using SPs in training and also in testing is well established in medical schools (Barrows, 1993;Köllner et al., 2016;McNaughton et al., 2008), and this didactic method is now an integral part of medical training in Germany (Bundesministerium für Gesundheit, 2017).I nm edical schools,S Ps are usually recruited from thegeneral population; training procedures androlescripts have been established. This can be done so successfully that well-trained actors with medical conditions cannot be differentiated from real patients (e. g., Ortwein et al., 2006).
SPs are muchl esse stablishedi np sychology departments and the formal training of psychotherapists. One reason may be that it is much more difficult to enact a patient with am ental disorder than someone with, say, abdominal pain. Although we acknowledge that mental disorders are difficult to portray,w ea rgue that the challenge posed by portraying mental disorders is one of the most important reasons for utilizing well-trained actors. While traditional curricula have heavily leanedo nr oleplayinga mong peers, trained actors are likely the more realistic option. Indeed, positive experiences with this didactic method are growing, and many studies have started to include it in the current postgraduatet raining programsfor psychotherapists (Eckel, Alpers et al., 2014;Nikendei et al., 2019;Partschefeld et al., 2013). We recently implemented at eaching module with SPs in our university'sc urriculum for the master'si np sychology to foster teaching based on the scientist-practitioner model. Our motivation wasthe anticipation of the revised psychotherapy training (Alpers & Steiger-White, 2020).
From the perspective of amultilevel concept of psychotherapyt raining, we decided to first focus on practical competencies (Vogel &A lpers, 2009), respectively,o n basic therapeutic techniques (Linden et al., 2007). As part of our class on therapeutic interviewing, students were first introduced to the theoretical models and then practiced basic interviewing skills in an anamnestic patient interview.A tt he end of the course, students were asked to use the knowledge they had acquired and what they had already practiced in peer-to-peer role-playing in a simulated interview with at rained SP. This module was experienced as particularly useful by the instructors and very well receivedb yt he students; course evaluations were excellent and revealed great satisfaction.M ost importantly,t he students' psychotherapeutic self-efficacy increased over time. Interestingly,this progress wasmost pronounced on those dimensions of therapeutic selfefficacyt hat were addressed in the class. These findings show firsta nd promising evidence that the method of SPs is well applicable in am aster'sp rogram of clinical psychologya nd may thusb eu seful for the new training format at Germanu niversities (Alpers &S teiger-White, 2020).
While there is as trong case that SPs are useful as a teaching method, it remainsunclear whether interactions with SPs can be used to objectively evaluate therapeutic competencies in exams. An ecessary first step in documenting its usefulness would thusbetoidentify measures that allow areliable assessment of students' competencies in simulated interactions.
One of the more establishedtools for evaluatingtherapeutic competency is the Cognitive Therapy Scale (CTS; see Young &Beck, 1980, 1986. The rating scale has been used in manyd ifferent settings, often in psychotherapy outcome studies, and, most importantly for our paper,i t has been used to study the effects of psychotherapy training. Fortunately, thereisanadapted andvalidated German version with very good psychometricp roperties (Weck et al., 2010). Although there is convincing evidence that the CTS is auseful tool to assess psychotherapeutic competencies in licensedp sychotherapists (see Weck et al., 2010), it has not been well established whether the scale is also useful to evaluate performancea ta ne arliers tage of trainees' university-level master'ss tudies. Moreover, we are unaware of any application of the scale to evaluate interactions in simulations with SPs.H owever,w ea rgue that suchi nstruments may bear potential for the assessment in the mandatory licensing exam,w hich must be established for the new licensing procedure in Germany (Bundesministerium für Gesundheit, 2020).
We therefore evaluatedt he properties of the scale in this pilot study by coding videotapedm aterial from our master'sstudents' program in clinical psychology (Alpers &S teiger-White, 2020). To this end, we used the CTS to evaluate psychotherapeutic competencies in students.
Our firstg oal wast od ocument the usefulness of an established scale regarding our students' performance. Only if ratingso fc ompetencyc an be achieved economically (practicability and time costs) would such an evaluation be reasonablefor routine evaluationsonalarge scale. As tandardized scale to evaluate therapeutic competency must be adaptable to the current setting and task. Second, we examined whether students' performance can be reliably measured on master'ss tudents. Third,a sa n indication of the validity of such ratings, we compared our students' scores with those of more experienced therapists (see Weck et al., 2016).O nly if values within our samplehad acertain range, and only if they plausibly differed from those obtained from more experienced psy-chotherapists, would we conclude that the information we obtain is meaningful for future exams.
Taken together,t his informationi se ssential to determine whether simulations of interviews with SPs are applicable in licensing examsasnow required nationwide in Germany.I nt he face of the growing international interestinbetter-structured training procedures, this may contribute to the dissemination of efficient and effective training and testing procedures for much-needed therapists.

Method Participants
The samples of SPs and the master'sstudentswere based on the samplecharacterizedinour previous report (Alpers &S teiger-White,2020).

Standardized Patients and Role Scripts
Agroup of 39 volunteers was recruited as SPs, though one had to be excludedbecause of unsuitable acting performance and an inadequate display of the mental disorder to be portrayed. Thus, the analyzed videotapeds essions included 38 volunteeredSPs.Mostwerewomen (71.1%), and they hadawide agerange with ameanof49.9(SD =19.25; range: 25-74). 1 None of them had been previously formally trained or professionally established as actors. Most of them were affiliated with our universitybut not part of the psychologyprogram; rather,they had the status either of guest students or senior students (N =3 5; 92.1 %). Three SPs were psychologys tudents who had as pecial interesti na cting but had not been previously enrolled in this course. They received8€compensation for each session.
All SPs completed a2 -hour introductory workshop directed by an experienced clinical psychologist to learn about the specificmental disorders they were to simulate. In their training, we emphasized improvisational acting more than what would be necessary for actors who portray ap atient with, for example, ab roken leg. Therefore, ap rofessional acting coachi nstructed the SPs with elementsofimprovisation skills,and they practicedtheir roles in a4 -hourw orkshopu nder hiss upervision.I na ddition, they were provided with detailed role scriptsp ortraying nine highly prevalentm entald isorders.T he role scripts included characterizationso fi ndividuals with majord epression,s ociala nxiety,s pecificp hobias,s ubstance abuse disorders, posttraumatic stress disorder,p anic disorder and agoraphobia, obsessive-compulsive disorder as well as somatic stress disorder (Steiger-White &Alpers, 2020). On about 6 -7p ages, the scripts provide information on the role'sb iography,c ase history,s everal characteristic statements that can bus used in the interview,information about the disorder,t ypicalb ody posture, and nonverbal signs. For amore extensive description of the role scripts, see Alpers and Steiger-White (2020).
Participants: Master Students and their Task 156 students took this class in threec ohorts (2017,2 018, 2019).T hereof, N =1 22 studentsc onsented to be videotaped while they conducted the simulated clinical assessments with the SPs. The videos of 13 students had to be excludedbecause all videos recorded with aparticular SP turned out to be useless; 5f urther videos were excluded because of poor sound quality (technical failure). Thus, a total of N =1 04 videos was available for our evaluation.
All participating students were enrolled in our master's curriculum in clinical psychologya tt he University of Mannheim. In addition to our course, they had previously completed at least one internship in clinical psychology; most intendedt oe ngage in ap rofessional carrier as a clinician.

RatersofTherapeutic Competencies
Two independent raters provided ratings on the CTS (one female and one male). One rater (KH) wasanexperienced psychologist with amaster'sdegree in clinical psychology who was at an advanced stage in her psychotherapy training. The other rater wasaclinically well-experienced (three clinical internships, each lasting at least 3months) master'ss tudent (NS) who had previously participated in the course himself. The first rater (KH) instructed and trained the second rater on coding the CTS (Wecke ta l., 2010); after this initial calibration, the ratingsw ere done independently.T he two raters had am ean age of 27.50 (SD =4.95).

Measuring Therapeutic Competency
We used the Germanv ersion of the CTS (Weck et al., 2010) to assess psychotherapeutic competencies in the master'ss tudents' interviews. The CTS is aw ell-established measure to assess psychotherapeutic competencies necessary to administer cognitive behavioral therapy (CBT). It has been evaluatedb yp sychologists in psychotherapytraining as well as by practicingpsychotherapists.
The scale consists of 14 items to evaluate the level of psychotherapeutic competencies on a7-point rating scale (0 = poor,1=barely adequate,2=mediocre,3=satisfactory, 4=good,5=veryg ood,6=excellent). The 14 items cover conceptually deriveds pecifict asks of successful CBT: (a) agenda setting, (b) dealing with problems /questions/ objections, (c) clarity of communication, (d) pacinga nd efficientuse of time,(e) interpersonaleffectiveness, (f)resource activation,( g) reviewingp reviouslys et homework, (h) using feedback and summaries, (i) guided discovery, (j) focus on central cognitions and behavior,(k) rationale, (l) selecting appropriate strategies, (m) appropriate implementation of techniques, and (n) assigning homework. In addition, these subdomains can be aggregated into two subscales of psychotherapeutic competencies and a total score. Because of the better test properties, we report the averages for the subscales Session-structuring versus Global psychotherapeuticcompetencies. Although they do notnecessarily containthe same number of items, averagingallows us to comparethe subscalesdirectly.
Fewadaptions to thescale were necessaryfor thep resentw ork. Becauseo ur simulatedi nterview,i nt he format of an anamnestic session, simulatesf irst contactb etween at herapist and ap atient, we decided to exclude the items "reviewing previously set homework" and "assigning homework" from our evaluation.M oreover,w ea lso excludedt he item "selecting appropriate strategies" becauseinterviewersinour settinghad no choice as to which strategies to choose from; rather,t hey were all limited to using "communication and interviewing techniques." Thus, for the total CTS, we evaluated the remaining 11 items.Finally,the subscaleSession-structuring competencies, included the items "agenda-setting,"" pacinga nd efficientuse of time,""guided discovery,""focus on central cognition and behavior,""rationale," and "appropriate implementation of techniques." Thes ubscaleG lobalp sychotherapeutic competencies included thei tems "dealing with problems /questions/objections,""clarityofcommunication,"" interpersonale ffectiveness,"" resource activation," and "using feedbacka nd summaries." The interrater reliability of this adapted version is reported in the Results section.

Procedure The Class
The interview to be evaluatedfor this study was part of a mandatory small-group seminar conducted in ab locko f two and ah alf days as part of the master'sp rogram in clinical psychology (Gesprächsführungsseminar). Students receivedcourse credit(twoEuropean CreditTransfer and Accumulation System points) for their participation. The generalg oal of the course wast ot each communication and interviewing techniques and to help studentst od eal with difficult interviewing situations. During the seminar, there were several practical units where students worked in triads to practice their communication and interviewing skills. Figure 1illustrates the generalprocedure.

The SP Interviews
On the last day of the course, each student conducted a 20-minute anamnestic interview with one of the SPs, who was trained to portray one of the roles described elsewhere (Steiger-White &Alpers, 2020). While one student interviewed the SP, another studento bservedt he interview silently to learn from observation and later provide informal feedback. Only when both student and the SPs provided informed consent were the sessions videotaped for subsequent evaluation.
The students' task was to conduct a2 0-minutei nterview with ap atient to gain enough information on the problem, the probabled iagnosis, and an overview of the case history.I mportantly,t he SPs were instructed to closelyf ollowt heir role scriptsa nd alsos imulate one of four particularly challenging behaviors, such as questioning the interviewer'st herapeutic competence, excessive lamenting, suicidal tendencies, and striving for intimacy. These challenges were chosen based on conceptual considerations designed as representationsofcommon interpersonal difficulties in psychotherapy sessions ( Noyon & Heidenreich, 2013) and were categorizedb yo ne of the two independent raters.
At the end of the simulated interview,t he master's studentsreceived feedback, first from the SPs, then from the observer.

Analyses
Once the generalo bservations on the procedure and evaluationh ad been provided, we conducted the following statistical analyses. First, we calculated mean scores for the global scale of the CTS and the two subscales for each rater.I nterrater reliability was calculated with Model3and at wo-mixed model with an absolute agreement definition where two raters evaluated all videotaped sessions( ICC (3,2) )( see Shrout &F leiss, 1979). The calculationsw erec arried out using the 27th version of the statisticals oftware SPSS Inc. (IBM,2 020). We used a 95 %confidenceintervaltotest for statistical significance.
Given satisfactory interraterr eliability,w ep lanned to calculate the mean scores of the ratingso ft he two independentr aters fort he following analyses. First, we calculated Cronbach's α fort he total scale and the two subscales to test the internal consistency of our mean ratings. Furthermore, we analyzed thedifferences between the two subscales fora ll threec ohorts to explore any potential inconsistencies after newly introducing an ew didactic element.Wecalculated amixed ANOVA with the between-factor Cohort and the within factor Scale Significant effects were followed up with paired-sample t-tests.
To compare with ab enchmark, we tested the global score of our students against as core obtained from published Germans tudies of the evaluation of psychotherapeuticpractitioners. Forthispurpose,weextracted scores of theC TS of an independentsample (Weck et al., 2014), droppingt hose items from their scores that were eliminated in our scales (M =3.88, SD =1.03). Using a t-test for two independent samples, we checked for significant differences between the psychotherapeutic competencies of the students and the more experienced practitioners.

ADescriptive Evaluation of the Rating Procedure
The Efficiency of the Training Procedure The two raters described the task as quite demandingbut doable. Because bothr aters were previously trained, the calibration of the rating procedure required some time. The calibration included several meetings in which the raters practiced their rating procedure on psychotherapy sessionsfrom standardized tutorial videos. Subsequently, the raters evaluate two videotapeds essionso ft he master'ss tudentse xplaining the cognitive rationaleo fe xposure therapy and implementing an exposure session with SPs. Thiss tepwise procedure sufficed to guarantee that the two raters were familiar enough with the videotaped material, and that therewerenoprofound disagreements in their interpretation of the items.

Rating Procedure
It was very well possible to accomplish the evaluation of many videos in an economicf ashion. Specifically,i tt ook about 50 minutes to rate each 20-minute segment of video footage. Completing the task revealed that some segments appeared to be more diagnostically relevant than others. In particular,t he raters had the impression that closelyobserving students during adifficult interview situation was most relevant to their overallimpression.

Face Validityfor Students' Interviews
The items on the scales of the CTS were subjectively applicable to what was theoretically being taught by the class and displayed by the students. Overall,t he items had good face validity for the evaluation of the students' performance. In addition, the two raters agreed that the scale Global psychotherapeutic competencies in particular appeared to have the best face validity.More precisely, dealingw ith problems/questions/objections and Using feedbacka nd summariesa ppeared to be the most relevant items to differentiate between students according to the raters.

Interrater Reliability of Competency Evaluation
The adapted scales internal consistency still had an acceptable Cronbach's α-value of .78 fort he global score (Field, 2013, Nunnally,1 978). Table 1s hows the agreement of the evaluations of the raters who did their assessments independently from each other.A ll items had satisfactory to good interraterr eliabilities fort he ratings of the two raters ranging from .58 to .86. Thus, we were could reliably assess psychotherapeutic competencies in the master'sstudents.

Students' Competency Scores
On adescriptive level, the students' average competency lay between am oderate to as atisfactory level of psychotherapeutic competencies at an item level (see Table 1); the two subscales and the total score were satisfactory. However,s ome items revealed poor performance on average, since some items may havea ddressed aspects to be expected in typicalt herapy sessions but not in the short interview task we gave our students. Importantly, the students' scores ranged broadly between 1.68 -4.55 on the total score. The session-structuringc ompetencies ranged between 1.25 -4.75, and the global psychotherapeutic competencies ranged between 1.60 -5.10. Apparently,t here were no floor or ceilinge ffects. Figure 2 depicts the histogram of total scores.

Discussion
Students in psychotherapy training must be instructed appropriately to acquire practical skills, and their competenciesmustbeevaluated as part of anyteachingprogram. This is particularly relevant regardingt he implementation of thenew psychotherapytraininginGermany,whichnow requires evaluationso fi nterviewsw ithS Ps as part of the official licensingprocedure.
In agreementwith the well-established scientist-practitioner model (Benjamin &B aker,2 000), the universitylevel curricula of clinical psychologya nd psychotherapy have started to focus more on combining theoretical and practical aspects (i. e., according to the scientist-practitioner model, see Alpers &H ofmann, 2007). To accomplish this, the field has started to experiment with new didactic methods to simulate realistic patient encounters. While teaching with SPs has been well established in medical schools, only recently was this method included in the curricula of psychologyprograms (Alpers &Steiger-White, 2020).
Importantly,G ermanl egislationn ow requires that SPs be part of the licensing exams( Bundesministerium für Gesundheit, 2020). However,t od ate, it remains unclear whetherpsychotherapeuticcompetenciescan be evaluated in MSc-leveluniversitystudents. We recently implemented themethodofSPs in ourcurriculumofclinicalpsychology andpsychotherapy andevaluated 104videotapedsessions of students andSPs simulating shortanamnesticinterviews with an SP.
This pilot study demonstrates that short interviewswith SPs mayb eu sefulf or evaluatings tudents' therapeutic competencies. First, we identified as cale that appears to be useful for rating purposes (CTS; Weck et al., 2010) and adapted it for our purposes.I ts time-and cost-effective scoringm akes it possible to evaluate manyo bservations as part of as tandardized exam in routine licensingp rocedures.
Second, the interratera greements we observed in numerouso bservations show that measurementsc an be obtained reliably.O nly this can guarantee fair and replicable grading as part of al icensure procedure. The rating procedure is feasible, and evaluators can be trained efficiently.
Third,o ur data provide preliminary evidence fort he validity of suche valuations. There is convincing face validity that the scales developed to evaluate professional psychotherapists can be transferreda nd adapted to evaluate students' performance. Moreover,awider ange of differences in students' performancec an be mapped by the evaluators' ratings. Most importantly, our observation that students' proficiency appears to plausibly differ from performanceo bserved in muchm ore experienced psychotherapists speaks to the validity of this assessment. At the same time, validity is limited because not all aspects to be expected in typicalt herapy sessionsw erep art of the short interview task we gave our students. Obviously, the match between task and evaluation can be improved by more rigorous item selection.
Taken together,this information provides the essential first evidence that simulations of interviews with SPs may be applicable in studente valuation.T his is particularly timely and relevantb ecause such evaluations are now mandatory nationwidei nG ermany as part of the revised licensing procedures. Aside from thesep ractical implications, thispilot study also provided us with several noteworthyobservations.
Interestingly,o ur studentsa ppeared to fair better on global psychotherapeutic competencies than in arguably more basic session-structuring. Thism ay be because the primary topic of the course focused on such global psychotherapeutic competencies (e. g., dealing with problems and interpersonal effectiveness) and less on sessionstructuring competencies (e. g., agenda-setting). Although we did not examine this as af ormal ap riori hypothesis, it suggests that our studentsp erformed better on those tasks that were explicitly part of the course, which may further indicate that the evaluation can capture relevant aspects of our instruction. Of course, future research shouldexplicitlyexamine the sensitivity to improvements in performance, ideally in ap re-post design. Apart from this necessity,w ef ound evidencef or suchs pecificity in our previous evaluation of self-reported therapeutics elfefficacy ( Alpers &S teiger-White, 2020). In that study, studentsr eported more psychotherapeutic self-efficacy for the specifics kills they were actually taught. Importantly,t he present data substantiate this finding witha more objective assessment of psychotherapeutic competenciesbyexternal raters.
Standardized ratingsa ppear to be able to capture competencies, which were previously identified as important forp sychotherapists (Vogel &A lpers, 2009). Future work buildingonthese findingsmay identify subscales on which students' competencies are low,sothat those areas may be targetedb ys pecific instruction or practice sessions of the curriculum.
While such evaluations are not necessarily bound to the use of as pecifics cale we chose, we would like to discuss severalobservations regarding the scale'sproperties. It is apparentlyadvantageousthat the CTS provides a grade-related evaluation scheme. In our sample, we found amediocretosatisfactorylevel of psychotherapeuticcompetenciesi ns tudentsw ho we obviously didn ot expect to performa sw ella sm uchm oree xperienced practitioners. This findingmay speaktothe validity of theCTS to capture ande valuatep sychotherapeutic competencies at alll evels of proficiency. As an extensiontothisresearch, future work should also examinet he scale'ss ensitivity to change. Couldtheypossiblycapture students' progress from before to afteracourse?
Of course, therea re severall imitations to this pilot study that need to be considered. First, althoughb oth of our raters had as ubstantial level of clinical experience, they were not yet licensedt hemselves, ap rerequisite for actual evaluations as they are legally defined as part of the licensing procedures. However,t he agreement between the two raters was substantial and comparesw ell with calibrated raters in other studies usingthe same scale (see Grikscheit et al., 2015;Weck et al., 2014Weck et al., , 2016. For licensingexams, which are legally binding, the scales first need to be evaluated with alarger number of much more extensivelyt rained raters (see Strahl et al., 2019). Moreover,i naformal licensing procedure this would require continuous training andevaluation(seeStrahletal.,2018).
Second, for our purpose, we had to adapt the global scale of the CTS. However,w ed id take this into account in our comparisons between students and practitioners. Importantly,a dapted versionso ft he CTS appear to be still functional. It is important to select those items (orto add to them according to the specific competencies to be examined according to one'sm odel; see Linden et al. 2007;V ogel &A lpers, 2009). Also, the videos we rated were certainly not completelycomparable to those used in the typicalscenarios where the CTSisused. The students' interviewswerenot complete therapy sessions; they were obviously shorter and did not comprise all of the typical elements.
Third,inthis pilot study,wedid not completely control how the SPs acted out the difficult interpersonal challenge we instructed them to portray.Obviously,this resulted in more heterogeneoussimulations comparedtowhat would be required for formal licensing exams.
Fourth, the pairing of students and assigned SPs was not completely independent (38 SPs acted in simulated situations with 104 master'ss tudents). Thism ight have resulted in asystematicbiasfor the evaluation of psychotherapeutic competencies. We did not control fors ubtle differences in difficulty,inpairings of sex, age, and other individual characteristics, most importantly the acting style of different SPs.A lso, we were unable to examine order effects, which may be influenced by observing other models before astudent hadtoperform themselves.While it mightb ep ossiblet ob etters tandardize such boundary conditions at anyone site,there will always be limits to the standardization across more than oneassessmentsite.
Regarding the efficiency of suchr atings, we acknowledge that therew as ac onsiderable cost and effort. Although the videotaped sessionsl asted only 20 minutes, the evaluation of the psychotherapeutic competencies of 104 master'sstudents required considerable time resources. If longer exams were to be evaluated, thesec osts would increase linearly.
Regarding the generalization of our observations, it is important to note that conceptualizations of what constitutes "proficient" therapeutic behavior need to be consideredi na ny evaluation. Only al imited subsectiono f therapeutic competencies was evaluated in this study, although there are, of course, manym ore aspects of professional expertise. Obviously,i nterviewing skills are important,albeit only one of many technical skills (Vogel &A lpers, 2009). Moreover,w ed id not study students' performancec oncerning equally important disorder-specific competencies (Linden et al., 2007), although adherence to specific manuals or at herapeutic rationaleh as repeatedlyb een proven to be relevantt oo utcome (e. g., Hauke et al., 2013;Weck et al., 2016).A nother tradition emphasizes the importanceofthe therapeutic relationship (Luong et al., 2020). It remains to be evaluatedw hether studentsmay vary in performance on profiles of different therapeutic skills and whether their performance results from an interaction of practical competencies with theoretical knowledge and interpersonal skills (Vogel &A lpers, 2009).
Although our teaching program was not specific to a particular therapeutic tradition, the instructors and raters were trained in CBT. Moreover,t he scale we used was based on aC BT rationalea nd has not been extensively evaluatedw ith therapists from other traditions. Future research is needed to explore whether similar scales can evaluate therapeuticp erformance rooted in other methods as well. While someb asic techniquesa nd concepts of therapeuticc hange clearly overlap (O'Donohue et al., 2000;O rlinsky&Howard,1 987),o ther aspectso ft herapeuticp erformance differ profoundly,a nd the specific skills to carry out basic techniques may require more specific attention (Linden et al. 2007).

Conclusion
Our observations provide apromising perspective for the evaluationo fs tudents' performance in professional interactionsw ith SPs.E stablished instruments to evaluate psychotherapeutic performance can be adapted and appliedefficiently andreliablyfor validevaluations of student competencies.