Communication skills training for improving the communicative abilities of student social workers

Abstract Background Good communication is central to effective social work practice, helping to develop constructive working relationships and improve the outcomes of people in receipt of social work services. There is strong consensus that the teaching and learning of communication skills for social work students is an essential component of social work qualifying courses. However, the variation in communication skills training and its components is significant. There is a sizeable body of evidence relating to communication skills training therefore a review of the findings helps to clarify what we know about this important topic in social work education. We conducted this systematic review to determine whether communication skills training for social work students works and which types of communication skills training, if any, were more effective and lead to the most positive outcomes. Objectives This systematic review aimed to critically evaluate all studies which have investigated the effectiveness of communication skills training programmes for social work students. The research question which the review posed is: ‘What is the effectiveness of communication skills training for improving the communicative abilities of social work students?’ It was intended that the review would provide a robust evaluation of communication skills training for social work students and help explain variations in practice to support educators and policy‐makers to make evidence‐based decisions in social work education, practice and policy. Search Methods We conducted a search for published and unpublished studies using a comprehensive search strategy that included multiple electronic databases, research registers, grey literature sources, and reference lists of prior reviews and relevant studies. Selection Criteria Study selection was based on the following characteristics: Participants were social work students on generic (as opposed to client specific) qualifying courses; Interventions included any form of communication skills training; eligible studies were required to have an appropriate comparator such as no intervention or an alternative intervention; and outcomes included changes in knowledge, attitudes, skills and behaviours. Study selection was not restricted by geography, language, publication date or publication type. Data Collection and Analysis The search strategy was developed using the terms featuring in existing knowledge and practice reviews and in consultation with social work researchers, academics and the review advisory panel, to ensure that a broad range of terminology was included. One reviewer conducted the database searches, removing duplicates and irrelevant records, after which each record was screened by title and abstract by both reviewers to ensure robustness. Any studies deemed to be potentially eligible were retrieved in full text and screened by both reviewers. Main Results Fifteen studies met the inclusion criteria. Overall, findings indicate that communication skills training including empathy can be learnt, and that the systematic training of social work students results in some identifiable improvements in their communication skills. However, the evidence is dated, methodological rigour is weak, risk of bias is moderate to high/serious or incomplete, and extreme heterogeneity exists between the primary studies and the interventions they evaluated. As a result, data from the included studies were incomplete, inconsistent, and lacked validity, limiting the findings of this review, whilst identifying that further research is required. Authors’ Conclusions This review aimed to examine effects of communication skills training on a range of outcomes in social work education. With the exception of skill acquisition, there was insufficient evidence available to offer firm conclusions on other outcomes. For social work educators, our understanding of how communication skills and empathy are taught and learnt remain limited, due to a lack of empirical research and comprehensive discussion. Despite the limitations and variations in educational culture, the findings are still useful, and suggest that communication skills training is likely to be beneficial. One important implication for practice appears to be that the teaching and learning of communication skills in social work education should provide opportunities for students to practice skills in a simulated (or real) environment. For researchers, it is clear that further rigorous research is required. This should include using validated research measures, using research designs which include appropriate counterfactuals, alongside more careful and consistent reporting. The development of the theoretical underpinnings of the interventions used for the teaching and learning of communication skills in social work education is another area that researchers should address.

Doing so will form a reliable, scientifically rigorous, and accessible account that can be used by educators and policy-makers to guide decisions about which approaches are effective in teaching communication skills to social work students. In this time of political uncertainty and financial constraint, 'it is important to accumulate evidence of the outcomes of social work education so that policy-makers and the public can be confident that it is producing high-quality social workers' (Carpenter, 2016, p. 192), who are suitably equipped to deal with the demands of social work practice. We conducted this systematic review to determine whether CST for social work students works and which types of CST, if any, were the most effective and lead to the most positive outcomes. To improve uptake and relevance, the systematic review was developed in consultation with stakeholders (including academics, students, practitioners, and people with lived experience) and advice was sought from leading social work organisations. The review also sheds light on areas where more research is required.

| OBJECTIVES
This systematic review aimed to critically evaluate all studies which have investigated the effectiveness of CST programmes for social work students. The PICO (Population, Intervention, Comparator, Outcomes) framework and stakeholder collaboration informed the development of the research question. Student social workers constituted the population, CST was the intervention under investigation, the absence of CST or a course unrelated to communication were the comparators, and attitudes, knowledge, confidence and behavioural changes were the outcomes of interest. Stakeholders had agreed that neither the comparator nor the outcomes should be specified within the research question itself, on the grounds that researchers and academics were unlikely to have specified these elements in the primary studies. The review built on an existing knowledge review (conducted by Trevithick et al., 2004) but was not restricted by the year of publication or language. The research question which the review posed is: 'What is the effectiveness of CST for improving the communicative abilities of social work students?' It was intended that the review would provide a robust evaluation of CST for social work students and explain variations in practice. To test the effectiveness of interventions, hierarchies of evidence point to systematic reviews of (preferably randomised) controlled trials. Therefore, we sought to conduct a rigorous and systematic review of such studies about CST, supporting educators and policy-makers to make evidencebased decisions in social work education, practice and policy. The studies were required to include an appropriate comparator to be eligible for inclusion in the review, irrespective of whether outcome data were reported in a useable way. Permitted study designs included: randomised trials, non-randomised trials, controlled before-after studies, repeated measures studies and interrupted time series studies. To be included, interrupted time series studies needed a clearly defined point in time when the intervention occurred and at least three data points before and three after the intervention. The justification for this wider range of study types was to identify any potential risk of harm which we hoped to assess through wider evidence. Potential risk of harm included any negative effects of CST on students' communicative abilities, for example, service users and carers might have indicated that students' poor communication left them feeling more confused, agitated, misunderstood or distressed (i.e., worse) than they did before the interaction.
To ensure quality of evaluation, all studies were critically appraised and an analysis of the results by study design was considered. The comparison group were composed of those who received no educational intervention or those receiving educational interventions other than CST. Trials comparing the effects of two different educational interventions to improve communication skills were also included in this review. In accordance with Campbell policies and guidelines (The Campbell Collaboration, 2014), studies without comparison groups or appropriate counterfactual conditions were excluded.

| Types of participants
All social work students who were taught communication skills on a generic qualifying social work course in a university setting were included hence undergraduate and postgraduate students were among the types of participants. Students on post-qualifying courses were excluded.

| Types of interventions
Only studies in which the intervention group received CST and in which the control group received nothing or received an alternative training to the intervention group were included. For the intervention, any underpinning theoretical model and any mode of teaching (taught input, videotape recording, role-play with peers, simulated interviews with service users and carers or actors) were considered acceptable. Interventions that took place either entirely or predominantly in a university setting were included.

| Types of outcome measures
Outcomes included changes in (1) knowledge, (2) attitudes, (3) confidence/self-efficacy and (4) behaviours measured using objective and subjective scales. It was anticipated that these measures might be study-specific rating scales, developed for use in evaluating communication skills. Stakeholder involvement indicated that behavioural change was an important outcome for all stakeholders. In addition, students and educators deemed confidence/self-efficacy to be a relevant outcome. In keeping with the literature on outcomes in social work education (Carpenter, 2005(Carpenter, , 2011, student satisfaction alone was not considered as an outcome measure in this review.

| Search methods for identification of studies
We conducted a search for published and unpublished studies using a comprehensive search strategy informed by the guide to information retrieval for Campbell systematic reviews (Kugley et al., 2017). We also sought advice from information specialists. Our search strategy included searching multiple electronic databases, research registers, grey literature sources, and reference lists of prior reviews and relevant studies. Study selection was not restricted by geography, language, publication date or publication status. The original search took place in September 2019 and an updated search took place in June 2021.

| Electronic searches
To identify eligible studies the following data sources were searched using the search strings set out in Supporting Information: Appendix A:

| Data collection and analysis
We collected and analysed data according to our protocol (Reith-Hall & Montgomery, 2019).
One reviewer (ERH) conducted the database searches, removing duplicates and irrelevant records. Having anticipated that the searches would result in very few records to screen, each record was screened by title and abstract by both reviewers (ERH and PM), to ensure robustness. Any studies deemed to be potentially eligible were retrieved in full text and screened by both reviewers. There were no disagreements hence discussions with an arbitrator was not required and consensus was reached in all cases.  The search strategy was developed using the terms featuring in existing knowledge and practice reviews and in consultation with social work researchers and academics, to ensure that the broad range of terminology was included. Search strings included terms relating to the intervention and population but not study design. A sample search strategy for Medline can be found in Supporting Information: Appendix A. Search strings and search limits were modified for each database. Proximity searching was not required.

| Data extraction and management
Once eligible studies were found, an initial analysis of intervention descriptions was undertaken for each. The Campbell data collection template form was used to identify the core components of programmes and to develop an overarching typology and coding frame.

Details of study coding categories
Components included: • Duration and intensity of the programme.
• Whether programme delivery included people with lived experience (e.g., service users and carers) • Whether programmes used audio and video recording • Study characteristics in relation to design, sample sizes, measures and attrition rates.
• Whether the study was conducted by a research team associated with the programme or an independent team.
• Stage of programme development, for example whether it was a new programme being piloted or an established programme being replicated.
• Participants' characteristics in relation to age, sex, ethnicity, geopolitical region and socio-economic background.
We considered subgrouping the different types of intervention and population, based on factors such as length of course and teaching methods, age and sex, however the small number of included studies did not warrant subgroup analysis.
Coding was carried out by the review team independently; discrepancies were discussed, and a consensus reached.
Quantitative data was extracted to allow for calculation of effect sizes (using mean change scores and post-test means and standard deviations). Data was extracted for the intervention and control groups on the relevant outcomes measured to assess the intervention effects.

| Assessment of risk of bias in included studies
Assessment of methodological quality and potential for bias was conducted using the Cochrane Risk of Bias tool for randomised studies  and the ROBINS-I tool for nonrandomised studies Sterne, Hernán, et al., 2016).

| Measures of treatment effect
Continuous outcomes were reported by the included studies, so we used the standardised mean difference (SMD) as our effect size metric where means and standardised deviations were provided by study authors. Where means and standard deviations were not available, we calculated SMDs from t-values and calculated standard deviations from standard errors where these were provided using recommended methods (Higgins et al., 2022). Hedges' g was used for estimating SMDs to correct for the bias associated with small sample sizes. In studies with more than two groups, we calculated effect sizes using the experimental and control groups that were most relevant to answering our research question or used data from groups with the largest numbers in them.

Treatment of qualitative research
This systematic review was limited to synthesising the available evidence on the effectiveness of CST to social work students. It was beyond the remit of this present review to synthesise the associated evidence related to process evaluations of such programmes hence we did not include qualitative research.

| Unit of analysis issues
The unit of analysis for this review was social work students. No unit of analysis issues were identified for the included studies. 4.3.6 | Dealing with missing data Study authors were contacted and accompanying or linked papers were sought in an effort to retrieve missing data.

| Assessment of reporting biases
Reporting was generally poor among the included studies as evidenced by limited use of reporting instruments such as CONSORT and no references to pre-published protocols were made by study authors. A more detailed discussion of this issue can be found in the Risk of Bias section. Use of a funnel plot, which helps to identify potential reporting bias in the included studies was not feasible, given the small number of studies included in this review.
The use of a highly sensitive and inclusive systematic search of bibliographic databases, grey literature sources, reference list searching, correspondence with study authors and hand searching sought to counteract potential bias in our reporting of this review.

| Data synthesis
As a result of this heterogeneity, meta-analysis was not feasible, nor was it possible to implement methods outlined in the protocol, such as sensitivity and subgroup analysis. I 2 and Tau 2 were not measured or reported in this review. Similarly, we were unable to use the new GRADE Guidance for Complex Interventions (unpublished) to summarise the overall quality of evidence relating to the primary outcomes. 4.3.10 | Subgroup analysis and investigation of heterogeneity N/a in view of there being no meta-analysis. 4.3.11 | Sensitivity analysis N/a in view of there being no meta-analysis.
Summary of findings and assessment of the certainty of the evidence N/a 5 | RESULTS

| Description of studies
There are 15 studies included in this review. An overview of the key characteristics of the included studies, which are described in terms of study design, participants, interventions, comparators, outcomes, outcome measures, geographical location, publication status and implementation factors are provided in Table 1.

| Results of the search
The main bibliographic database and registers search, completed in September 2019, returned 1998 records with an additional 12 added after the search was updated in June 2021. After 882 duplicate records were removed, 1128 were subjected to initial screening by title, and abstract if necessary, following which a further 1021 records were removed because they were not relevant to the topic.
Of the 107 remaining records, 2 could not be retrieved despite endeavours to locate them through different libraries and searches, therefore 105 records were fully screened for eligibility, 9 of which met the inclusion criteria.
Another 650 studies were identified through recent editions of five key journals identified through the database search. A further 19 studies were identified through other methods including citation searching within the included studies. Of the 669 studies subjected to initial screening, 627 were removed because they were not relevant to the topic. One record could not be retrieved resulting in 41 records being fully screened for eligibility, of which 34 records were excluded, and 7 records (reporting 6 studies) were included.
Of the fifteen studies which met the inclusion criteria for this systematic review, two experiments are reported in a single paper (Barber, 1988), one study is reported in two papers (Greeno et al., 2017;Pecukonis et al., 2016)-with both authors contributing to the write-up of each, and another study (Larsen & Hepworth, 1978) is also written up as the first author's PhD thesis (Larsen, 1975 (Barber, 1988;Collins, 1984;Keefe, 1979;Ouellette et al., 2006;Rawlings, 2008;Toseland & Spielberg, 1982;VanCleave, 2007;Vinton & Harrington, 1994) employed a case-controlled design, some of which conform to the parameters of a pre-experimental static group comparison design (Campbell & Stanley, 1963). This means that participants were divided between two groups but in a nonrandomised way. Given students were not randomised to the different groups, these studies suffer from weak internal validity, with confounders such as maturation, the Hawthorne effect, testing effects and pre-existing differences between the intervention and control groups. Such issues are common in educational research.
Six of the studies reported in seven papers were randomised controlled trials (RCTs), five of which were conducted in the mid to late 1970s. The increase of research activity surrounding this topic during this decade likely results from the development of teaching models such as Ivey and Authier's micro-counselling model (Ivey & Authier, 1971;Ivey et al., 1968)

and the Truax and Carkhuff Human
Relations training model (Carkhuff, 1969c;Truax & Carkhuff, 1967), alongside the development of research measures, including the Carkhuff scales (Carkhuff, 1969a(Carkhuff, , 1969b, which are the most cited research instrument in this review. Wells (1976) is the earliest of the included studies to use an RCT design, comparing role-play and students' 'own problem' procedures, but the sample size contained just 14 students. Hettinga (1978) had a somewhat larger sample of 38 students, in which immediate feedback from an instructor was compared with group feedback provided later.
Although quasi-randomisation took place, it is unlikely the allocation method affected the results. In the same year, Larsen and Hepworth (1978) investigated the role of experiential learning; controls received traditional didactic instruction. Schinke et al. (1978) randomly allocated a group of 23 students to either an intervention group or a waiting-list control. Laughlin (1978) used a more complex design consisting of two experimental groups and two control groups.
Despite using pre-tests, a strategy which can help overcome methodological challenges associated with small sample sizes (social work cohorts are typically small), the study was hopelessly underpowered. The most recent of the included studies, reported in the two papers by Pecukonis et al. (2016) and Greeno et al. (2017), offers the most robust research design of the included studies. Not only did they exceed the minimum sample size calculated in an a priori power analysis, but the overall risk of bias was lower than other studies included in this review.
In terms of comparators, in four of the studies the control group received no intervention (Barber, 1988;Rawlings, 2008;Toseland & Spielberg, 1982); three studies reported controls receiving treatment as usual (TAU) (Greeno et al., 2017;Keefe, 1979;Pecukonis et al., 2016;VanCleave, 2007), however the TAU in Greeno and Pecukonis' study was an online intervention, as opposed to an absence of an intervention; and a further five studies compared two different interventions. These included an experiential approach with traditional didactic learning (Larsen & Hepworth, 1978); lab-based versus lecture-based training (Collins, 1984); online versus classroom-based teaching (Ouellette et al., 2006); videotaped interview playback with instructional feedback versus peer group feedback (Hettinga, 1978); and role-play versus students' 'own problems' procedures (Wells, 1976). In a rather complex design, Laughlin 1978 study included two treatment arms, and two control groups, one of which received no treatment. In two subsequent studies (Schinke et al., 1978;Vinton & Harrington, 1994), the controls had a delayed start (operating as a waiting list procedure).
Significant issues with measurement are evident within the included studies and are acknowledged by several of the researchers (Collins, 1984;Greeno et al., 2017;Laughlin, 1978;Vinton & Harrington, 1994). Methodological challenges will be considered in Section 6.
Sex. Ten of the included studies report on the number and percentage of men and women in the student samples. In Collins' (1984) study, of the 54 students in the lab group, 17% (N = 9) were men, however of the 13 students from the lecture group sample, 46% (N = 6) were men; the number of men in the lecture group was unusually high. Collins (1984, p. 74) acknowledges this is not explained by the admissions procedures at either of the universities involved in the study. However, it must be remembered that the 13 students from the lecture group, who volunteered to be part of the study, are not necessarily representative of the cohort demographic.
Age. Due to differences in reporting practices, the age characteristics of the students in the included studies are harder to compare. In the same five studies identified above (Barber, 1988-experiment 2;Keefe, 1979;Larsen & Hepworth, 1978;Toseland & Spielberg, 1982;Wells, 1976), age characteristics were not reported. identified that 78% of students (N = 25) were Caucasian, almost 10% (N = 3) were Hispanic, just over 6% (N = 2) were Biracial, 3% (N = 1) were African American and 3% (N = 1) were defined as 'Other'. In the study reported by Pecukonis et al. (2016) and Greeno et al. (2017), just over 51% (N = 28) of students were Caucasian, 45% (N = 24) were Black and almost 4% (N = 2) were Hispanic. In VanCleave's (2007) study, over 95% (N = 43) of students were Caucasian, one student was African American and one was Japanese-each accounting for just over 2%. The earlier studies did not report on the ethnicities of their participants, reflecting changes to trends in the collection of demographic data.
Data is absent for other demographic characteristics within the included studies.

Location characteristics
There is little variation within the geo-political contexts in which the included studies were conducted. This is important because it reflects some priorities such as the primacy placed on experimental design, at the expense of others, including stakeholder involvement. One study, Collins (1984) (N = 67) was undertaken in Toronto, Canada, whilst Barber (1988) reports on two experiments conducted in Victoria, Australia (N = 82). One study, Toseland and Spielberg (1982) did not provide a location (N = 68). The remaining 11 studies were carried out in different US states, where the focus on evidence-based teaching and learning in social work education is firmly established. Involvement and participation from people with lived experience was noticeably absent-the second of the Barber (1988) experiments and the client interviews in Collins' (1984) study being the exceptions.
None of the included studies were conducted in the UK, where a strong tradition of service user and carer involvement in social work education prevails, which arguably explains, but does not justify, the omission of contributions from people with lived experience within the body of research identified in this review.

Intervention characteristics
Theoretical orientation. Experiential learning is referred to in the majority of the studies (Collins, 1984;Greeno et al., 2017;Keefe, 1979;Larsen & Hepworth, 1978;Laughlin, 1978;Pecukonis et al., 2016;Rawlings, 2008;Schinke et al., 1978;Toseland & Spielberg, 1982) as the underpinning theoretical orientation of the intervention under investigation. However, the term is not applied consistently. With its wide range of different meanings, ideologies, methods and practices, experiential learning is conceptually complex and difficult to define (Moon, 1999). Conceptualisations arising from two different traditions are evident within the included studies: first, the work of Carkhuff and Truax (1965) and Ivey and Authier (1971), which derive from psychotherapy, and second, the work of Kolb (1984) and Schön (1987) which is grounded in a constructivist view of education and has been particularly instrumental within professional courses.
Although deriving from psychotherapy, the microskills counselling approach developed by Ivey et al. (1968) and Ivey and Authier (1971) has informed the teaching of interviewing skills in social work education. Content comprises well-defined counselling skills including attending behaviour, minimal activity responses, and verbal following behaviour. Six of the included studies made reference to the work of Ivey and colleagues, however five of them (Collins, 1984;Hettinga, 1978;Laughlin, 1978;Rawlings, 2008;VanCleave, 2007) did so simply within a discussion of the wider literature. It is only in Schinke et al.'s (1978) study where Ivey's work has a direct impact on the empirical evaluation itself; an adapted version of the Counsellor Effectiveness Scale developed by Ivey and Authier (1971) was used as one of the study's measuring instruments.
Referred to as the Human Relations training model, the work of Carkhuff and Truax (1965) and Carkhuff (1969c) has been more influential than Ivey's approach. A brief exploration of empathy as a theoretical construct helps to explain why Carkhuff and Truax's work has influenced social work education and practice. Whilst linguistic relevance can be seen in the Greek word 'empatheia', which means appreciation of another's pain, the philosophical underpinnings of the term empathy actually derives from the German word Einfühlung.  (Collins, 1984;Larsen & Hepworth, 1978;Laughlin, 1978;Toseland & Spielberg, 1982;VanCleave, 2007;Vinton & Harrington, 1994;Wells, 1976) used the Carkhuff scales (Carkhuff, 1969a;1969b) as an outcome measure in their empirical research. As identified by Elliott et al. (2018), the Carkhuff scales were some of the earliest observer measures, which may well explain the popularity of this instrument. The focus the researchers of the included studies placed on empathy is striking and will be considered further in subsequent sections.
Also apparent in the literature is the experiential learning approach deriving from the experiential learning cycle developed by Kolb (1984) and the concept of reflective practice articulated by Schön (1987). Rawlings (2008), who provides the most comprehensive overview of experiential learning in the included studies, draws on the work of both. Huerta-Wong and Schoech (2010) (Bandura, 1971), recognising that the modelling of skills is important for learning. Ideas about self-reinforcement (Bandura, 1976) influenced Laughlin (1978), in a consideration of the impact of internal and external motivation. The exploration into the role of self-efficacy by Rawlings (2008) in skill development was informed by self-efficacy and social cognitive theory (Bandura, 1997). Behaviour, according to social cognitive theory, is influenced by goals, outcome expectations, self-efficacy expectations and socio-structural determinants (Bandura, 1982). Much of the literature indicates the potential impact of students' selfefficacy beliefs for the teaching and learning of communication skills in social work education.
Irrespective of which conceptualisation is used, the value of experiential learning has withstood the test of time and is the front runner in terms of the theoretical orientation underpinning the teaching and learning of CST, or specific components of it, both of which are addressed in this review. Toseland and Spielberg (1982) consider experiential learning fundamental to the systematic training that the teaching of communication skills requires. In a review of practice of teaching and learning of communication skills in social work education in England, Dinham (2006) identified a strong emphasis on experiential and participative teaching and learning methods.
Other theories, for example ego psychology in Hettinga (1978) are discussed particularly in the dissertation theses; however, the theoretical orientations underpinning the pedagogical approaches are largely ill-defined or absent from the outcome studies in this review.
Delivery and approach. The included studies do provide some insight into the delivery format and teaching methods under investigation, especially where studies compare teaching modalities or approaches.
A concerning issue in the earlier studies is whether practicing skills in communication and empathy (utilising an experiential component) is more effective than a purely didactic traditional lecture-based approach. Larsen and Hepworth (1978) compared the efficacy of a traditional didactic intervention with an experiential intervention used within communication laboratories. Collins (1984) also compared a lecture-based training course with a skills lab training course.
The results of these studies supported practice-based experiential learning. By contrast, when Keefe (1979) compared an experientialdidactic course to a structured meditation experience with a control group, the experiential group did not make the expected gains, whereas those receiving meditation did. In an extension of the basic design, Keefe (1979) found a combination of experiential training and structured meditation proved most effective.
Some of the more current studies focussed on classroom-based teaching versus online delivery, an issue particularly relevant in the current global pandemic, which in many instances has seen teaching move to purely online or blended delivery. Ouellette et al. (2006) compared a classroom-based instructional approach with an online web-based instructional approach and found no significant differences between the two. In the study reported by Greeno et al. (2017) and Pecukonis et al. (2016) however, live supervision with standardised clients compared favourably with the TAU, which they describe as being online self-study.
Other studies compared more specific components within the intervention. The role of active learning for students was important whether that included participation in role-play with peers or simulated clients. Wells (1976) in comparing the use of roleplay with using participants' own problems, found neither one proved preferential but identified the active experimentation of students as being the key factor in their interpersonal skills development.
The role of the instructor was also an issue of interest. Hettinga (1978) examined the benefits of 1:1 instructor feedback compared with small group feedback, Laughlin (1978) focused on the role of instructor feedback versus self-evaluation whilst Greeno et al. (2017) and Pecukonis et al. (2016) expressed optimism for the use of live supervision. Again, whilst no claim can be made for whom the feedback provider (self, peers or instructor) should be, active engagement with the evaluation and feedback process seems to be the underlying mechanism which facilitates change. Opportunities for playback was another area for investigation. Reflecting the rapid development of technology in recent years, Laughlin (1978) investigated the use of audiotapes whereas Vinton and Harrington's (1994)  outcome measures at least. Some of the authors included in this review are confident in recommending specific teaching methods. Toseland and Spielberg (1982) suggest practice, feedback and modelling are necessary; Schinke et al. (1978) add role playing, cueing, and positive reinforcement to this list. Greeno et al.'s (2017) advice to educators is similar, with the added recommendation of supervision. Pecukonis et al. (2016) highlighted modelling of techniques to students as key. In a review of empathy training in which meta-analysis was feasible, Teding van Berkhout and Malouff (2015) suggest that studies in which behavioural skills were developed through instruction, modelling, practice and feedback had higher, but not significantly higher, effect sizes than those in which some or all of these components were missing. Findings from qualitative research indicate that students learn communication and interviewing skills through the practice, observation, feedback and reflection that accompany simulation and role-play activities, which Banach et al. (2020) found mapped onto Kolb's (1984) model of experiential learning. Further exploration of these issues is required.
Implementation factors: Amount, duration and uptake. Considerable variation in terms of amount and duration is evident across the included studies. The briefest intervention was a single 4-h training session (Schinke et al., 1978) whilst the longest intervention, described as 'extensive' appears to be interspersed throughout a 4-year degree course (Barber, 1988). Literature has documented the ability to teach empathy at a minimally facilitative level in as few as 10 h (Carkhuff, 1969c;Carkhuff & Berenson, 1976;Truax & Carkhuff, 1967). Indeed, Larsen and Hepworth (1978) found positive change occurred from a 10-h intervention, but 'estimated that 20 h, preferably 2 h per week for 10 weeks, would be ample' (p. 79).
However, Toseland and Spielberg (1982) suggested that the course under investigation in their study, which lasted approximately 45 h (30 h of which were experiential learning in a laboratory) may not be sufficient to increase students' skill to the level of competence expected of a professional worker. In the study undertaken by VanCleave (2007), implementation of the intervention appeared to vary between students, because 'when assignment by cohort could not be achieved, training was subdivided into smaller groups. Given the flexibility of the researcher, individual training was accommodated' (p. 119). It is likely this variation occurred to enhance student participation in the study, maximising data collection opportunities for research purposes.
A number of studies did not report details regarding the amount and duration of the intervention, and some provided rather vague or imprecise details, rendering comparative aims regarding amount and duration of training futile.
The studies focus on what was taught, but data on uptake is sorely lacking. Some of the included studies (Collins, 1984;Larsen & Hepworth, 1978;Ouellette et al., 2006) compared students' personal and demographic characteristics alongside their pre-course training and/or experience. The role of sex, age and pre-course experience were key considerations. Social work courses attract few men compared to women, and often have small cohorts, making judgements on demographic characteristics difficult. Vinton and Harrington (1994), who examined the impact of sex on students' empathy levels, found women had higher QMEE scores than men at both pre and post-test. This is consistent with a study undertaken by Zaleski (2016) which found female students in medicine, dentistry, nursing, pharmacy, veterinary, and law were found to have higher levels of empathy than their male peers.
Counterintuitively, age was not found to be significantly correlated to communication skills. Ouellette et al. (2006) queried whether age was a factor in learning, yet summary statements was the only item on their interview rating scale found to be significantly correlated to age. Collins (1984) found that the amount of prior training had no impact on students' ability to demonstrate interpersonal skills. Similarly, in a comparison of the mean levels achieved by groups dichotomised on the basis of age, sex, previous social work experience, and undergraduate social welfare or other major, Larsen and Hepworth (1978) found such attributes yielded no significant differences on either pre-or post-test scores. Both studies challenge the assumption that students with more social care experience before training possess more or better communication skills than those without. In terms of uptake, Larsen and Hepworth (1978, p. 78) suggested that 'a mix with contrasting skill levels appears advantageous', because 'students with higher-level skills modelled facilitative responses in the practice sessions for students with lower skills, thus encouraging and assisting the latter to achieve higher levels of responding'. In the study conducted by Laughlin (1978), selfinstruction students exhibited significantly higher mean scores for enjoyment and number of optional practice items completed than students in an instructor-led group. Self-instruction 'creates a sense of self-reliance, confidence, and personal responsibility for learning which promotes enjoyment and devotion to task not present under circumstances of external control' (Laughlin, 1978, p. 67). Self-instruction appears to facilitate uptake. Other issues affecting student learning such as concentration or care-giving responsibilities and their impact on uptake were not addressed in any of the studies included in this review.

| Risk of bias in included studies
Both review authors assessed the risk of bias of the included studies, independently applying the 'Risk of bias' tools-ROB 2 (Sterne et al., 2019) for the randomised trials and Robins-I for the non-randomised studies of interventions (Sterne, Hernán, et al., 2016). Both tools comprise a set of bias domains, intended to cover all issues that might lead to a risk of bias . We used the Methodological Expectations of Cochrane Intervention Reviews (MECIR) guidance , The Revised Cochrane risk-of-bias tool for randomised trials (RoB 2)

and the Risk of Bias in
Non-randomised Studies of Interventions (ROBINS-I): detailed guidance  to inform our judgements. To answer the review's research question, we were interested in assessing the effect of assignment to the intervention, as opposed to adherence to the intervention. Discrepancies between review author judgements were resolved through discussion.
Both reviewers judged there to be a moderate or high/serious risk of bias in all but three of 15 included studies, with only one study receiving a low risk of bias rating overall, with an additional two studies receiving a low bias rating overall for one outcome measure but not the other. The lack of information for certain domains was a problem in all of the studies, highlighting that in future, researchers should report a greater level of detail to enable the risk of bias to be fully assessed. Using a tool such as CONSORT SPI (Grant et al., 2018) would facilitate this.

| Risk of bias in randomised trials
As shown in Table 3, there was considerable variation within the risk of bias domains of the non-randomised studies. Only one study was rated as low risk of bias, one was rated as having 'some concerns', three were rated as being at high risk of bias and one study (reported in two papers) received a mix of overall bias ratings, according to the outcomes measured. Limitations were evident in all of the studies, including the lack of information reported in domains 2 and 5.
Domain 1-Bias arising from the randomisation process Randomisation aims to avoid an influence of either known or unknown prognostic factors. There was considerable variation provided by the study authors regarding the randomisation process.
Where there was sufficient information about the method of recruitment and allocation to suggest the groups were comparable with respect to prognostic factors (Hettinga, 1978;Larsen & Hepworth, 1978;Laughlin, 1978), the risk of bias was considered low. This level of detail is provided by Laughlin (1978): a table of random numbers ensured allocation sequence generation; manila envelopes were used for allocation sequence concealment; and potential prognostic factors such as age, prior job and training experience were measured as equivalent for all groups at the outset.
Conversely, information required for ROB 2 was missing from the other studies, some of which was gleaned by directly contacting study authors. Elizabeth Greeno provided additional details about the randomisation process, enabling the risk of bias in the study reported by Greeno et al. (2017) and Pecukonis et al. (2016) to be rated as low. Schinke et al. (1978) and Wells (1976) stated that students were randomly assigned to groups, however they did not provide any details about how students were recruited or allocated. Both authors have passed away so further information could not be ascertained.
Although there were no obvious baseline differences between groups to indicate a problem with the randomisation process, the absence of detailed information led to a judgement of some concern for both studies in this domain.
Domain 2-Risk of bias due to deviations from the intended interventions (effect of assignment to intervention) Given placebos and sham interventions are generally not feasible in educational interventions, students and staff tended to be aware of which intervention the students were assigned to, particularly since students were largely drawn from cohorts known to each other.
Control group scores were markedly different from intervention scores, suggesting contamination between groups did not occur. In reviewing the papers, there were no reports of control groups receiving the active intervention, nor did trialists report that they had changed the intervention. However, a lack of information about deviations from the intended interventions is reflected in our use of the term 'not reported'.
Similarly, there was no information as to whether an appropriate analysis had been used to estimate the effect of assignment to intervention. Higgins et al. (2019, p. 26) acknowledge that 'exclusions are often poorly reported, particularly in the pre-CONSORT era before 1996'. Apart from the study reported by Pecukonis et al. (2016) and Greeno et al. (2017), the randomised trials included in this review were conducted in the 1970s, which helps to explain why making interpretations of the risk of bias for these empirical studies was particularly difficult. For most of the randomised trials, there was nothing to suggest that there was  Larsen and Hepworth's (1978) study was provided in Larsen's (1975) PhD thesis, allowing the risk of bias in Domain 4 to be downgraded from high to low. b Greeno et al. (2017) and Pecukonis et al. (2016) used the same data set. Email contact with Elizabeth Greeno confirmed that randomisation occurred using SPSS.
potential for a substantial impact (on the result) of the failure to analyse participants in the group to which they were randomised.
However, again a lack of information led the reviewers to replace a bias rating with 'not reported'. Wells (1976) study provides an exception to this rule. Noting that two students from each group swapped due to placement clashes, Wells did not perceive this as an issue. However, the data of these students were analysed in terms of the interventions they received rather than the interventions to which they were initially assigned. As a result, both review authors deemed the risk of bias rating to be high for this domain.
Domain 3: Risk of bias due to missing outcome data Some studies (Greeno et al., 2017;Larsen & Hepworth, 1978;Pecukonis et al., 2016;and Schinke et al., 1978) retained almost all of their participants hence no data or very little data were missing, warranting a low risk of bias rating for the missing outcome data domain. Pecukonis et al. (2016) for example, identify low attrition as a strength in their study, highlighting that retention at T3 and T4 was 96% and 94%, respectively (p. 501).
Three studies were judged to be at high risk of bias due to missing data and a lack of any accompanying information. Laughlin (1978) identified that out of 68 students in her study, 'seven subjects failed to complete either the pre-or post-test because of absence from class on the day these tests were administered' (p. 40). Information about the group for which data were missing was not provided. In Wells' (1976) study, the four students who were not present at post-testing were excluded from the analysis, and whilst the number may seem small, they represent a significant proportion of the original study sample, which comprised only 14 students. Hettinga (1978, p. 57) 'assumes that no interaction of selection and mortality occurred', yet researcher assumptions do not constitute evidence. In all three of these studies, the reasons for the absences were unclear and there was no evidence to indicate that the result was not biased by missing outcome data. The authors did not discuss whether missingness depended on, or was likely to depend on, its true value. Yet it is possible, likely even, that missingness in the outcome data could be related to the outcome's true value if, for example, students who perceived their communication skills to be poor decided not to attend the post-test measurements. As a result of this, and the study authors' lack of attention to these issues, we judged there to be a high risk of bias due to missing outcome data in the trials undertaken by Hettinga (1978), Laughlin (1978), and Wells (1976).  Larsen and Hepworth's (1978) article, the risk of bias would have been rated conservatively high because the study does not say if the outcome assessors knew to which group the students belonged. However, in her PhD thesis, on which Larsen and Hepworth's (1978) article is based, Larsen (1975) clearly states that three social work raters were blind to the identification of the student and to their intervention/control group status. The additional information enabled reviewers to judge this domain as being at low risk of bias.
In studies where two different outcome measures were used, bias ratings were judged separately, indicated by the split outcomes in domain 4 in Table 3. For Greeno et al. (2017), Pecukonis et al. (2016) and Schinke et al. (1978), low bias ratings were given for measures of behaviour change due to evidence of independent raters, blind to the intervention status of participants. However, the self-report measures used by each, warrant a higher risk of bias.
According to the Rob 2 guidance, for self-reported outcomes, the assessment of outcome is potentially influenced by knowledge of the intervention received, leading to a judgement of at least some concerns (Higgins et al., 2019, p. 51 et al. (1978) required participants to rate their attitudes towards their own performance. In this study, students were aware of which intervention group they belonged to, yet the waiting list control procedure reduced potential issues such as social desirability, hence a rating of some concerns was considered appropriate. In the study reported by Greeno et al. (2017) and Pecukonis et al. (2016), whose subjective measures included perceived empathy and self-efficacy respectively, it seems probable that students were aware of the intervention group they belonged to. Given there were no differences between groups on either outcome measure, it seems unlikely that participants' reporting of the outcome(s) was influenced by knowledge of the intervention received. The 'some concerns' rating was applied to both. Hettinga (1978) reports that the researcher had no knowledge as to which treatment groups the participants were randomly assigned.
However, the outcome assessors were the students who were completing two subjective measures-the Rosenberg Self-Esteem Scale and the self-perceived interviewing competence (SPIC) questionnaire. It is likely that the students were aware of which intervention they received. The lack of change for self-esteem meant this outcome measure was given the 'some concerns' rating.
However, we took a more cautious approach to students' selfperceived interviewing competence as the results were significant.
Knowledge of the intervention could have had an impact, for example, if those students in the self-instruction group had tried harder. There was no information to determine the likelihood that assessment of the outcome was influenced by knowledge of the intervention received, which led to a conservative judgement from the reviewers of a high risk of bias for this outcome measure.
In the study conducted by Laughlin (1978), the high risk of bias is due to known differences in the measurement of the outcome between the intervention groups. Students in the self-reinforcement Consequently, verifying how reported results were selected was not possible. Due to a lack of information in all the included randomised trials, we could not make a risk of bias judgement for this domain.

Overall risk of bias
Only one included study (Larsen & Hepworth, 1978) received a low risk of bias rating overall; one study (Schinke et al., 1978) was considered to have some concerns; three studies (Hettinga, 1978;Laughlin, 1978;Wells, 1976)  (2021) state that the completeness of reporting of published articles is generally poor, and that information fundamental for assessing the risk of bias is commonly missing. Whilst reporting is seen to be improving over time, the majority of the included trials were conducted in the 1970s, and are evidently, a product of their time. Where study authors have not provided sufficient information, we have indicated that information was not reported. We also acknowledge that we adopted a conservative approach, therefore we might have judged the risk of bias harshly, potentially elevating the risk of bias either at the domain level or in the overall bias judgement for some studies. Frequent discussions supported our endeavours to be consistent.

| Risk of bias in non-randomised studies
As shown in Table 4, there are clear similarities across some domains as well as some marked differences in the risk of bias ratings of the non-randomised studies, which were judged in accordance with Robins-I. For the overall bias ratings, the review authors either judged there to be a 'moderate' or 'serious' risk of bias in each study outcome reviewed, or in one instance, a 'no information' rating was issued, because assessing the risk of bias was not feasible.
Domain 1: Bias due to confounding Sterne, Higgins, et al. (2016, p. 20) suggest 'baseline confounding is likely to be an issue in most or all NRSI', which was reflected in the included studies of this review. The lack of information in two of the studies (Keefe, 1979;Vinton & Harrington, 1994)  related to both the intervention and the outcome (Sterne, Higgins, et al., 2016, p. 30). There was nothing to suggest that any students were selected based on participant characteristics after the intervention had commenced in any of the studies, therefore a low risk of bias was given to all of the studies for this domain.

Domain 3: Bias in classification of interventions
All of the non-randomised studies used population-level interventions therefore the population is likely to be clearly defined and the collection of the information is likely to have occurred at the time of the intervention (Sterne, Higgins, et al., 2016, p. 33). As a result, the bias ratings for this domain were low in almost all of the studies. We could have issued no information ratings but decided a low rating was probably a better reflection of the non-randomised studies in this domain. One study provides an exception to the rule. Collins (1984, p. 67) stated, 'it was not possible to establish a control group where no laboratory training took place'. This suggests the lecture-trained and lab-trained groups were not as distinctly different as was necessary, hence the serious risk of bias rating was applied for this domain.

Domain 4: Bias due to deviations from intended interventions
None of the studies reported on whether deviation from the intended intervention took place, hence the no information rating was issued for this domain across all of the studies.

Domain 5: Bias due to missing data
For some of the non-randomised studies (Collins, 1984;Keefe, 1979;Ouellette et al., 2006;Toseland & Spielberg, 1982), data sets appeared complete or almost complete. In VanCleave's (2007) study, where attrition was slightly higher, the number of missing participants was similar across the intervention group (N = 3) and control group (N = 2); reasons for drop-out were also provided. A low bias rating was given for the missing data domain in these studies.
In Vinton and Harrington's (1994) study, a complete data set was provided for the QMEE scores, hence a low bias rating judgement was warranted, but the absence of student numbers for the Carkhuff scores meant a bias rating for this outcome measure could not be issued. An absence of information, on which to base a judgement, was also reflected in the results of Barber's (1988) experiments.
In Rawlings' (2008) study, results were reported as if all student data were present, however data were missing for some of the entering students. It is concerning that the results tables do not acknowledge the missing data. An imputational approach such as last observation carried forward or the use of group means would have enabled missing data to be dealt with, but instead the researcher has simply analysed the data available. Given that the missingness is not explained, both reviewers agreed that a serious risk of bias was justified.

Domain 6: Bias in measurements of outcomes
The timing of outcome measurements was problematic in three of the studies. A delay of approximately 3 weeks occurred in Collins' (1984) study for students completing the analogue measures, which reduced the time gap between pre-andpost-test training scores. A bias rating of moderate concern was justified given this could have led to an under-estimation of the positive gains made by students on this outcome measure.
In Keefe's (1979) study, although students were tested after their respective interventions, the interventions were of different durations hence the data collection time points varied. These are not comparable assessment methods. The meditation group was also tested three times, thus familiarity with the test may have produced the higher scores on the Affective Sensitivity Scale, rather than demonstrating a genuine improvement. Keefe (1979) states that levels of meditation attainment were blind rated (p. 36), however students in the experiential intervention group self-assessed only, the subjectivity of which increased bias in the measurements of outcomes. These issues elevated the risk of bias in this domain to serious.
VanCleave (2007) reports, 'the Davis self-inventory was completed by the participant before, or following, each 8 excerpt role played situation' (p. 118). Inconsistency surrounding the timing of when the instrument was completed led to a serious bias rating for the outcome measure of empathic concern and perspective taking.
However, a low rating was given for empathic response where timing issues were not a cause for concern and independent raters were not aware of students' intervention group status. The different ratings applied to each outcome is represented by the split ratings for this domain in Table 4.
The same approach of splitting the outcome measures domain was taken in Rawlings' (2008) study. The direct practice outcome was judged to have a low risk of bias rating because assessors were blinded to the intervention status, whereas the self-efficacy outcome received a moderate risk of bias rating, as the students themselves were the outcome assessors. Given the students comprised discreet cohorts, knowledge of the intervention group was not considered problematic by the reviewers. Conversely, the self-assessment measure in Vinton and Harrington's (1994) study warranted a serious risk of bias rating. The potential for study participants to be influenced by knowledge of the intervention they received was considerable. The emotional empathy scores of the control group dropped considerably at post-test, which could be an indication that the students had become aware that their peers were receiving beneficial interventions aimed at developing empathy, which they were not. Discussions between students were more likely in this study given they were all in the same cohort. Contamination effects could have impacted students' selfassessment scores.
Independent outcome assessors and appropriate blinding were used in all of the outcome measures used in Collins' (1984) study and in the video-tape interviews in Ouellette et al.'s (2006) study, which, with the exception of the timing issues associated with Collins' (1984) analogue measure, resulted in low bias ratings for the outcomes measures in these two studies.
Key information was lacking in some studies. Notably in Barber's (1988) experiments, a judgement about the methods of outcome assessment could not be made at all due to the absence of information. Toseland and Spielberg (1982) described their judges as being independent but did not state whether or not they were aware of which intervention the student had received. For the outcome relating to empathic response, Vinton and Harrington (1994) provided no information about blinding or the independence of the outcome assessors. Potentially then, this study is also at risk of researcher allegiance bias. If, for example, the outcome assessors were part of the same institution as the instructors and the students, or of even more concern, if the assessors were the instructors, then this could pose a serious risk of bias, because potentially they have a vested interest in the findings. It was not possible to establish assessor independence, so the reviewers opted for a 'no information' rating for the Carkhuff scales outcome measurement in Vinton and Harrington's (1994) study.
Research suggests that if study authors play a direct role, studies are more likely to be biased in favour of the treatment intervention (Eisner, 2009;Maynard et al., 2017;Montgomery & Belle Weisman, 2021). There is a distinct possibility that researchers of the included studies delivered the interventions themselves, leading to a further source of bias. VanCleave, for example, who had 19 years of teaching experience as an adjunct in the university where her research was conducted, acknowledged that 'the researcher acted as teacher and facilitator in the intervention, which is typically not a recommended research strategy' (VanCleave, 2007, p. 117). The same issue is likely present in at least some of the other nonrandomised studies, although there was a lack of information from which to establish its presence or impact.

Domain 7: Bias in selection of reported results
There was no obvious bias in the reporting of results for any of the reported outcomes in the non-randomised studies, however, there were no protocols or a priori analysis plans with which to compare the reported outcomes with the intended outcomes. Studies were not reported elsewhere hence external consistency could not be established. The 'no information' category was deemed most appropriate by both reviewers.

Overall risk of bias judgement
Only two studies (Ouellette et al., 2006;Toseland & Spielberg, 1982) received an overall bias rating of moderate, reflecting a moderate rating in the confounding domain. Other studies (Barber, 1988;Collins, 1984;Keefe, 1979;Rawlings, 2008) were considered to be at serious risk of bias overall, due to receiving a serious risk of bias rating in at least one domain. For one study (Vinton & Harrington, 1994), the absence of information in several domains led to a 'No information' rating in the overall risk of bias judgement for one outcome measure but a serious risk of bias in another.
Similarly, another study (VanCleave, 2007) also received a split rating for the overall risk of bias domain, with a moderate risk of bias for one outcome measure and a serious risk of bias for the other.

| Effects of interventions
The results, as shown in Table 5, are reported for the data that is available, relevant to answering the research question, using either the mean post-test differences between intervention groups and control groups or the mean change score between the two groups.  (Greeno et al., 2017), to 3 months of role-play and 3 weeks of meditation (Keefe, 1979), to a multitude of components including art and music (VanCleave, 2007) to the use of videotapes of an unspecified amount and time period (Vinton & Harrington, 1994). Meta-analysing such disparate interventions would therefore not be meaningful.

Gagnier et al. (2013) identified twelve recommendations for
investigating clinical heterogeneity in systematic reviews. In terms of the review team, one of us (PM) is a methodologist and the other (ERH) has significant relevant clinical expertise. ERH regularly discussed issues relating to population, intervention and measurement characteristics with the stakeholder group-who included educators, students and people with lived experience. This provided a range of different perspectives, encouraging us to be reflective and reflexive in our approach, including recognising our own biases. In relation to planning and the rationale for the selection of clinical variables we hoped to consider, these were described a priori in the protocol. Other methods require statistical calculations for which we did not have sufficient data. For example, we had hoped to perform a subgroup analysis relating to the intensity of the interventions, but such data were not sufficiently available-absent in four of them and described in non-numerical terms (e.g., as 'extensive' or 'one day') in a further three. Gagnier et al. (2013) acknowledge the challenge posed by the incomplete reporting of data.
Given the extreme clinical heterogeneity, meta-analysis was neither feasible nor meaningful. Instead, the findings are synthesised narratively and are organised according to a refined version of a classification of educational outcomes, developed by Kirkpatrick (1967); which is well-known and widely used. It was refined by Kraiger et al. (1993) to distinguish between cognitive, affective and skill-based outcomes, and adapted by Barr et al. (2000) followed by Carpenter (2005)

| The importance of empathy
Reported in 9 of the 15 included studies (Collins, 1984;Greeno et al., 2017;Keefe, 1979;Larsen & Hepworth, 1978;Laughlin, 1978;Pecukonis et al., 2016;Toseland & Spielberg, 1982;VanCleave, 2007;Vinton & Harrington, 1994;Wells, 1976), empathy is a common topic of interest within this review. The pivotal role of empathy in social work practice is widely acknowledged (Forrester et al., 2008;Gerdes & Segal, 2009;Lynch et al., 2019), hence the need for students to develop empathic abilities is deemed critical for preparing them for social work practice (Greeno et al., 2017;Zaleski, 2016). As a skill which can be 'taught, increased, refined, and mediated' (Gerdes & Segal, 2011, p. 143), it is hardly surprising that empathy features so frequently within the empirical literature. Truax & Carkhuff (1967 (p. 46) describe empathy as 'the ability to perceive accurately and sensitively the feelings, aspirations, values, beliefs and perceptions of the client, and to communicate fully this understanding to the client'. As study authors Vinton and Harrington (1994, p. 71) point out, 'these are separate but related phenomenon'.
Empathy is a multifaceted phenomenon ( Carpenter (2005Carpenter ( , 2011 suggests that Level 2a outcomes relate to changes in attitudes or perceptions towards service users and carers/ care-givers, their problems and needs, circumstances, care and treatment. Motivational outcomes and self-efficacy also comprise this level (Kraiger et al., 1993).

Attitudes and perceptions towards clients
Students' perceptions towards clients was an outcome of interest for a number of studies included in this review. Affective sensitivity (Keefe, 1979), emotional empathy (Vinton & Harrington, 1994), empathic concern and perspective taking (VanCleave, 2007) and perceived empathy (Greeno et al., 2017)  understanding has been further defined as an affective process and a cognitive process. These different ways of conceptualising empathy are evident within the included studies, and in the choice of measuring instruments the researchers employed.

Affective and cognitive outcomes
To ascertain students' abilities to detect and describe the immediate affective state of clients, Keefe (1979) employed Kagan's scale of affective sensitivity (Campbell et al., 1971), which consists of multiple-choice items used with a series of short, videotaped excerpts from actual counselling sessions. In Keefe's study, a positive and significant effect size of 0.32 was only found once the intervention group had been taught meditation in addition to the experiential training they received, correlating with blind ranked levels of meditation attainment. Keefe (1979) reported that the combined effects of both conditions produced mean empathy levels beyond those attained by master's and doctoral students. Segal et al. (2017, p. 98) suggest that using meditation can promote emotional regulation, which can be considered fundamental to empathy. Dupper (2017, p. 31) suggests that mindfulness is an effective strategy for 'reducing implicit bias and fostering empathy towards members of stigmatised outgroups'. Both propositions could explain why the combined interventions in Keefe's (1979) study proved most effective.
Also viewing empathy as an affective state, Vinton and Harrington (1994) sought to assess students' 'emotional empathy', which they describe as 'the ability to be affected by the client's emotional state' (p. 71). Vinton and Harrington (1994) employed a different outcome measure-the Questionnaire Measure of Emotional Empathy (QMEE) (Mehrabian & Epstein, 1972), which emphasises the affective component of empathy including emotional arousal to others' distress. Two intervention groups received an instruction package utilising videotapes, one relying on selfinstruction, the other also receiving input from an instructor and peer group, whilst the control group received no intervention. At post-test, we found a small effect size of 0.21 between the 'video other and self' and the controls, however the QMEE scores of both groups had actually declined. Despite these results, Vinton and Harrington (1994) suggested that further investigation into the use of videotape or film is warranted.
Building on the suggestion by Vinton and Harrington (1994) that film can assist the development of empathic understanding, the students in VanCleave's (2007) study watched a 2-h commercial film, with 30 min of reflection and discussion. The self-report measure they used comprised two subscales from the Interpersonal Reactivity Index (IRI) (Davis, 1980): the first, empathic concern addresses the affective component of empathy and the second, perspective taking focusses on the cognitive component of empathy. Despite using a broader conceptualisation of empathy and a more inclusive measure, which produced an effect size of 0.22, changes were not statistically significant.
Utilising a different instrument still, Greeno et al. (2017) sought to measure students' perceived empathy using the Toronto Empathy Questionnaire (TEQ) (Spreng et al., 2009), which views empathy as an emotional process, but is based on items from the QMEE and the IRI.
The effect size at post-test was −0.26, with study authors reporting no statistically significant difference between groups. Given a behavioural measure of empathy used by Greeno et al. (2017) demonstrated a statistically significant small effect size for the intervention group, 'the lack of change across time and groups' on the self-reported TEQ scores was 'unexpected' (p. 803).
No statistically significant changes in students' empathic understanding were identified in the studies above, irrespective of the type of self-report measure used. The challenges of measuring empathy through self-reports (Lietz et al., 2011) are clearly evident in this review and will be discussed further in Section 6. Interviewing increased (by an average of 7 points) for both groups over time.

Self-esteem and self-efficacy
Self-esteem, which reflects how people perceive themselves and includes a sense of goodness or worthiness, was an outcome measure in just one of the included studies. Hettinga (1978) argued that self-esteem, as a critical dimension of professional self-dependence, directly relates to the attainment of skills.
However, he used The Rosenberg Self-Esteem Scale (RSE) (1965), an instrument measuring global self-esteem, in his study.
For students in the intervention group, who experienced videotaped interview playback with instructional feedback, the selfesteem score dropped very slightly. For the control condition, who received feedback delivered in a small group format, the selfesteem score remained unchanged. Although we found a small effect size for Section 1, Hettinga suggested the findings were not significant, indicating the intervention had no impact on students' self-esteem scores. Parker (2006) differentiates between the global nature of selfesteem and the context specific nature of self-efficacy. Perceived self-efficacy beliefs 'influence whether people think pessimistically or optimistically and in ways that are self-enhancing or self-hindering' (Bandura, 2001, p. 10), which has implications for students' skill development. Self-efficacy is 'an individual's assessment of his or her confidence in their ability to execute specific skills in a particular set of circumstances and thereby achieve a successful outcome' (Bandura, 1986, as quoted in Holden et al., 2002. Literature in the counselling field indicates that self-efficacy may predict performance (Larson & Daniels, 1998), and can thus serve as a proxy measure. The idea that self-efficacy is a means to assess outcomes in social work education has gained traction in recent years (Holden et al., 2002(Holden et al., , 2017Quinney & Parker, 2010).  Two of the included studies measured self-efficacy. Pecukonis et al. (2016) found no change in students' self-efficacy scores, either between the brief motivational interviewing intervention group and the TAU control group, or over time. Rawlings (2008), who evaluated the impact of an entire university degree, found students exiting Bachelor of Social Work (BSW) Education had significantly higher self-efficacy scores (mean score of 6.78) than those entering it (mean score of 4.40). Through multiple regression analysis, results showed that BSW education positively predicted self-efficacy. However, students' self-efficacy ratings did not correlate with their practice skill ratings. Surprisingly, after controlling for BSW education, self-efficacy was found to be a negative predictor of direct practice skill. Rawlings (2008, p. xi) explains that 'self-efficacy acted as a suppressor variable in mediating the relationship between education and skill'. This unexpected finding reflects the controversy surrounding the use of self-efficacy as an outcome measure, which will be revisited in Section 6.3. Schinke et al. (1978) asked students to rate their attitudes towards their own role-played interviewing performance. A large effect size of 0.93 indicates that CST positively affected the attitudes students had about their performance.

Knowledge
The acquisition of knowledge relates to the concepts, procedures and principles of working with service users and carers. Carpenter (2005), after Kraiger et al. (1993), separated knowledge outcomes into declarative knowledge, procedural knowledge and strategic knowledge. Only procedural knowledge-'that used in the performance of a task' (Carpenter, 2011, p. 126), featured as an outcome in this review, reported in three studies (two publications).
Procedural knowledge. Barber, 1988 (p. 4)  However, other studies which ensured that students were matched on factors such as demographic variables and pre-course experience (e.g., Toseland & Spielberg, 1982), produced more positive results.
Thus, Barber's paper is an exception to the rule, such that his findings should be interpreted cautiously, with due consideration of the measurement and design issues evident within both experiments and the serious risk of bias, due to confounding.
In Toseland and Spielberg's (1982) study, two of the four measures employed also tap into the procedural knowledge outcome because students judged the ability of others to respond in a helpful way. First, a film of client vignettes was shown to students who had to select from five different responses, rating them from 'destructive' to 'most helpful' using the second part of a Counselling Skills Evaluation. Second, through the Carkhuff's Discrimination Index (Carkhuff, 1969a), students rated the helpfulness of four counsellor responses to a set of client statements. Difference scores were generated by comparing students' ratings with those produced by trained judges. Discrimination scores indicated that students who had received the training were better able to discriminate between effective and ineffective responses to clients' problems, and their ratings closely matched those of trained judges. With effect sizes of −1.31 for the Carkhuff Discrimination Index and −0.53 for the Counselling Skills Evaluation part 2, and a very high confidence level of 0.001, the findings were significant.

Skills
Skills have been organised hierarchically within the literature on social work education outcomes to include initial skill acquisition, skill compilation and skill automaticity (Carpenter, 2005(Carpenter, , 2011Kraiger et al., 1993). Skill automaticity did not feature as an outcome in this review, which possibly reflects the point made by Carpenter (2005); that 'the measurement of the highest level of skill development, automaticity, poses significant problems' (p. 14). To our knowledge, no valid measure of automaticity for communication skills currently exists.
Initial skills. Initial skills, which are often practised individually, in response to short statements or vignettes, were the most popular outcome reported in this review. 'Trainee behaviour at the initial skill acquisition stage of development may be characterised as rudimentary in nature' (Kraiger et al., 1993, p. 316).
The initial skills considered fundamental for demonstrating empathy were evidently interesting to the researchers of the included studies. Variations of the Carkhuff scales (Carkhuff, 1969a(Carkhuff, , 1969b, which are widely used in social work education (Hepworth et al., 2010), were employed in seven of the included studies (Collins, 1984;Larsen & Hepworth, 1978;Laughlin, 1978;Toseland & Spielberg, 1982;VanCleave, 2007;Vinton & Harrington, 1994;Wells, 1976). The Carkhuff scales comprise two subsets: empathy discrimination (being able to accurately identify the level of empathy response) and empathy communication (putting that discriminated empathy into a congruent action response) (Carkhuff, 1969a(Carkhuff, , 1969b. The Carkhuff scales can require either a written or verbal response to a written statement or audio/video vignette, although instruction was originally mediated through audio recordings (Toukmanian & Rennie, 1975). Independent raters evaluate the level of empathy shown, selecting from five levels whereby level one represents low levels of empathy and level five indicates high levels. Level three is considered to be a minimally facilitative empathic response.
Using a slightly adapted version of the written statements format of the Carkhuff (1969b) scale, Larsen and Hepworth (1978) assessed students' skill levels in providing empathic responses to 'written messages', which they suggest was highly significant (p < 0.001). We calculated a large effect size (1.51), demonstrating as predicted, that the experimental groups surpassed the control groups on achieved levels of performance. Toseland and Spielberg (1982) sought to replicate and expand on Larsen and Hepworth's (1978) study by developing and evaluating a training programme comprising core helping skills, including genuineness, warmth and empathy. Two of the measures they used capture the initial skills outcome. First, through Carkhuff's Communication Index, as described above, students were asked to act as though they Evaluation. Students in receipt of the training increased their ability to communicate effectively using the ten helping skills. Nerdrum and Lundquist (1995) suggest that because Larsen and Hepworth (1978) and Toseland and Spielberg (1982) reported ratings for total communication index rather than empathy specifically, that lower empathy scores may have been concealed. Certainly, the instructors in the study reported by Nerdrum and colleagues , which narrowly missed the inclusion criteria for this review, found that empathy was the most difficult of the facilitative conditions for students to grasp. In addition, methods of training and methods of measurement have been confounded in earlier studies, potentially leading to over inflated treatment effects .
To evaluate an interviewing skills course, Laughlin (1978), also using the Carkhuff instrument, sought to test self-instructional methods, in which one experimental condition relied on selfreinforcement whilst the other experimental condition received external reinforcement and feedback from an instructor. Both experimental groups produced greater learning gains after training than either of the two control groups. Interestingly, there was no significant difference between the gain scores of the two experimental groups. Laughlin (1978, p. 65) suggests that 'self-managed behavior change can, under certain circumstances, prove to be as efficacious as externally controlled systems of behavior change'. However, students in the self-reinforcement group rated their own empathic responses, whereas the supervisor rated the responses of students receiving the other experimental condition. As Laughlin (1978) acknowledged, 'the self-instruction group may be considered a product of inaccuracy in the self-evaluation process' (p. 68). Other studies have identified that students often over or underestimate their abilities (Kruger & Dunning, 1999). Based on their mean gain scores, we calculated a large effect size of 1.22 between the REITH-HALL and MONTGOMERY | 31 of 46 experimental condition who received external reinforcement and feedback and the control group who received no instruction. Vinton and Harrington (1994) also appear interested in the role of the self in student learning, and they too used the Carkhuff scales to investigate this issue. At post-test, a large effect size (0.88) was observed between the 'videotape self and other' group and the controls. At one month follow-up, Vinton and Harrington (1994) found the majority of students in the intervention groups reached the level Carkhuff deemed to be facilitative.
To compare the effects of roleplay and using participants' own problems for developing empathic communication skills through facilitative training, Wells (1976) used a variant of Carkhuff (1969a) communication test in which students were asked to respond empathically in writing to four tape-recorded helpee statements before training and to a different set of four statements after training.
Contrary to Wells' assertion that no differential effect between roleplay and 'own problems' procedures was identified and the suggestion that active experimentation of students in both groups explains their modest outcome gains, we found a large effect size of 0.84 at post-test. This finding should be interpreted cautiously given it is based on just five students per group. Collins (1984) used two written skills measures-the Carkhuff stems, using written client statements as stimuli and a Skills Assessment Measure (SAM), which uses an audio-video client stimulus. Both measures seek to capture outcomes that can be categorised as initial skills. The mean scores on the Carkhuff stems at post-test were slightly higher for lab trained students than lecture trained students. Effect sizes were 0.60, 0.78 and 1.13 for empathy, warmth and genuineness respectively. However, Collins (1984) reports that statistical significance was only reached for empathy, which he suggests might be because lecture and lab training prepare students for training on the relatively straightforward measure of producing written statements as responses to short client vignettes.
Warmth and genuineness might be easier to demonstrate than empathy hence lecture-based students could manage them satisfactorily.
Similar, but slightly higher findings were demonstrated through the Skills Acquisition Measure (SAM), wherein students were asked to respond in writing to a series of vignettes. They were advised that their responses should be based on what they would say if they were conducting the interview. Student responses to the SAM were scored by trained raters using the Carkhuff scales. The post-test scores of lab-trained students compared favourably with the lecture-trained students. Large effect sizes of 1.21, 1.37 and 1.77 were found empathy, warmth and genuineness respectively. Collins (1984) concluded that findings from the Carkhuff stems and the Skills Acquisition Measure provide evidence that lab-based training is more effective for teaching interpersonal interviewing skills for social work students than lecture-based training. Carkhuff (1969a) suggested similarities between responses to the stimulus expressions in written form and verbal form and responses offered in an actual interview with a client. However, it should be noted that this alleged equivalency of measures has been questioned throughout the literature. VanCleave (2007) noted that making an advanced verbal empathic response is arguably more challenging than producing written statements. In her study, expert raters used the Carkhuff's Index for Communication scripts (CIC) to evaluate the videotaped responses of students to actors who verbally delivered excerpts based on the Carkhuff stems. Tapes contained vignette responses, rather than role-played sessions in their entirety. With a large effect size of 1.79, students in the intervention group demonstrated more empathy than the students who did not receive the empathy response training.
In summary, multiple studies demonstrated an increase in social work students' communication skills, including empathy, following training. The results for actual skill demonstration are modest yet promising.
Compilation. The compilation of skills is the term coined by Kraiger et al. (1993) to refer 'to the grouping of skills into fluid behaviour' (Carpenter, 2005, p. 12). Methods for measuring the compilation of skills include students' self-rating of competencies and observer ratings of students' communication skills in simulated interviews (Carpenter, 2011). Wilt (2012) argued that simulation fosters more in-depth learning than discussions, case studies, and role-plays, due to the location of the student in the role of the worker and real-time decision-making that includes ethical considerations.
In the study by Collins (1984), analogue interviews, which consisted of a 10-min role-play of a student in the worker role with a student in the client role, showed modest gains, whereby 23% of students in the lab group improved by 0.5, to a level which Carkhuff and Berenson (1976) suggested was the sign of an effective intervention. This was significantly lower than the 52% who showed 0.5 improvement on the Skills Acquisition Measure. However, Collins (1984) suggests that direct comparisons of the findings is problematic given the delay (of approximately 3 weeks) in students completing the analogue measures, which reduced the time gap between preand-post-training scores. Despite this, improvements shown in the analogue interviews were still significant. When comparing the two interventions-lab versus lecture, the lab-trained students demonstrated more skill than the lecture-trained group, as demonstrated by very large effect sizes of 1.74 for empathy, 1.80 for warmth and 1.88 for genuineness. Hettinga (1978) sought to measure the impact of videotaped interview playback with instructional feedback on student social workers interviewing skills. A tailor-made instrument was used to measure self-perceived interviewing competence (SPIC). At post-test, the mean score for the combined intervention groups was 62.60 whereas for the control groups the mean score was 57.47. This finding was supported by moderate to large effect sizes of 1.10 for Section 1 and 0.64 for Section 2, albeit with small sample sizes. The significantly higher scores for the intervention group suggest that students' self-perceived interviewing competence was positively impacted by videotaped interview playback with instructional feedback. Hettinga (1978) acknowledged the problem of using self-reports as a measure of skill accomplishment. This is considered further in Section 6.3.
Both methods (self-ratings and observer ratings) were used in the study conducted by Schinke et al. (1978). 5.4.5 | Level 3: Behaviour and the implementation of learning into practice Collins (1984) was the only study in this review to include a behavioural outcome. Scores from client interviews, which consisted of tape-recorded interviews with clients at the start of their field practicums, were compared to scores from the analogue role-play interviews at the end of the training to investigate the transfer of skills into practice. There was a drop for lab-trained students from their analogue role-play scores to their client interviews-from 2.72 to 2.22 (T = 7.59) for empathy, 2.79 to 2.35 (T = 6.82) for warmth and 2.63 to 2.28 (T = 6.65) for genuineness. These findings suggest students did not transfer their learning from the laboratory into practice, which Collins (1984) suggests was because of measurement anxiety, problems with the measures and the fundamental differences between lab and fieldwork settings. The anticipated outcome of a positive change in the modification of perceptions and attitudes of students (including cognitive and affective changes) following training was not born out in the data. This may in part be a result of how these outcomes are conceptualised and measured, with self-reports being particularly problematic. Of the 15 included studies in this review, two studies, reported in one paper (Barber, 1988) (N = 82) identified a negative outcome for the acquisition of knowledge, whereby trained students placed less value on responsive and unresponsive interviewing behaviour and were less accurate in their ability to predict clients' reactions than their untrained counterparts.
However, there was no convincing evidence to suggest that the teaching and learning of communication skills in social work education causes adverse or harmful effects.
The review identified considerable gaps within the evidence, further research is required. This is discussed in Section 7.
6.1.1 | Level 1: Learner reactions The evidence was inconclusive as only two studies (N = 108) contributed data. However, the findings, whilst limited, reflect a criticism of the growing trend, in the UK at least, to rely on quality assurance templates, which collect end of course satisfaction ratings only, and fail to measure outcomes (Carpenter, 2011). 6.1.2 | Level 2a: Modification in attitudes and perceptions One study (N = 23), Schinke et al. (1978) found that students' positive attitudes towards their skills were almost three times higher among students who had received CST than those who had not. Whilst promising, the evidence was inconclusive because too few studies contributed data. The review also highlights the

| Level 3: Changes in behaviour
The evidence was inconclusive due to the fact only one study (N = 67) reported this outcome.
6.1.6 | Level 4: Changes in organisational practice and benefits to users and carers The outcomes was not addressed in any of the studies included in this review.
6.1.7 | Adverse effects The evidence was inconclusive as only one paper (N = 82) contributed data.

| Overall completeness and applicability of evidence
The included studies indicate, albeit tentatively, that interventions for teaching communication skills in social work education seem to have a positive impact, at least on demonstrable skills outcomes, and in the short-term. Only Barber (1988)  In terms of publication bias, we recognise that there will be some PhD theses and trials containing negative results which we have not located in this review, and we acknowledge that publication bias could potentially be an issue. We took steps to minimise the risks including a wide reaching and extensive search (excluding outcomes) and contacting subject experts to identify any publications we might have missed through our search strategy. Strategies typically used to assess publication bias, such as funnel plots, were not feasible due to their small size and number, and lack of power.
Extreme levels of heterogeneity and moderate to high/ serious risk of bias ratings in the studies included in the review, meant meta-analysis was not feasible, and consequently a narrative review was undertaken. Outcomes were analysed and structured according to the outcomes framework for social work education developed by Carpenter (2005), after Kirkpatrick (1967), Kraiger et al. (1993) and Barr et al. (2000). Although data exists for some outcomes in levels 1-3, none of the included studies addressed outcomes at level 4a-changes in organisational practice or level 4b-benefits to users and carers, therefore significant gaps in the evidence base remain.

| Quality of the evidence
Whilst there was overall consistency in the direction of mean change for the development of communication skills of social work students following training, we must acknowledge that the body of evidence is small in terms of eligible studies and that rigour across this body of evidence is low. The assessment of methodological quality and the risk of bias, examined using the ROB 2 tool for randomised trials and the ROBINS-I tool for nonrandomised study, was judged to be moderate to high/serious, or incomplete, in all but one of the included studies. Confounders such as differences at baseline, missing data and the failure to address missingness appropriately, and the knowledge outcome assessors had about the intervention and its recipients were the most significant detractors from the internal validity of the studies reviewed.
Empathy has featured in skills training for more than 50 years, however as the studies in this review indicate, 'evidence of empathy training in the social work curriculum, remains scarce and sketchy' (Gerdes & Segal, 2011, p. 142). As Gair (2011, p. 791) maintains, 'comprehensive discussion about how to specifically cultivate, teach and learn empathy is not common in the social work literature', and the evidence that does exist is fairly limited. The same criticisms have been levied against research into the teaching and learning of communication skills in social work education more generally (Dinham, 2006;Trevithick et al., 2004). Given the range and extent of bias identified within this body of evidence, caution should be exercised in judging the efficacy of the interventions for improving the communicative abilities of social work students.

| Concerns about definitions and conceptualisations
One of the challenges evident in this review is the considerable variation in the way the study authors define key constructs, particularly in relation to empathy. Defining empathy remains problematic (Batt-Rawden et al., 2013) because the construct of empathy lacks clarity and consensus (Gerdes et al., 2010)  service user/patient-rating (second person assessment) and observer rating (third person assessment) (Hemmerdinger et al., 2007). Ratings from service users were absent from the included studies, possibly because of geographical factors. Most of the included studies were conducted in North America where the inclusion of service users and carers in social work education is less prominent than in the UK, for example. Many of the included studies used validated scales whereas others developed their own measures. However, even with validated scales, measurement problems were encountered by the study authors.

Self-rating
Much of the outcome data in social work education has relied on selfreport, a trend reflected in this review. Self-reports appeared appropriate for measuring satisfaction with teaching and practice interventions in Laughlin (1978) and Ouellette et al.'s (2006) studies, although these outcomes did not correlate to student's improvement in skills. Self-efficacy scales are another type of self-report, one which has been adapted for research into the teaching and learning of communication skills of social work students specifically (e.g., Tompsett, Henderson, Gaskell Mew, et al., 2017b). They are inexpensive and easy to administer and analyse. However, the limitations of using self-efficacy as an outcome measure are widely acknowledged (Drisko, 2014). Response-shift bias is one limitation of self-efficacy scales discussed in the literature, whereby some individuals may change their understanding of the concept being measured during the intervention. Such 'contamination' of self-efficacy scores (Howard & Dailey, 1979) can mask the positive effects of the intervention. This may explain why no change was identified by Pecukonis et al. (2016); however since a retrospective pre-test was not issued to the students in their study, neither the presence nor impact of response-shift bias can be established. Alternatively, the scales themselves may have contributed to the surprising results found by Rawlings (2008) and Pecukonis et al. (2016) since neither were properly validated. The subjectivity of self-efficacy scales has been identified as another area of concern. Previous research has found that students' self-ratings do not necessarily correlate with those of field instructors/practice educators (Fortune et al., 2005;Vitali, 2011), lecturers or service user-actors . In this review, self-efficacy scores and externally rated direct practice scores did not correlate in Rawlings (2008) study.
Self-report instruments are still the most common way to measure empathy (Ilgunaite et al., 2017;Segal et al., 2017). However, the challenges associated with measuring perceived empathy through self-reports (Lietz et al., 2011;Robieux et al., 2018)  This 'ceiling and testing effect' (Greeno et al., 2017, p. 803) has been identified elsewhere (Gockel & Burton, 2014) and might result in a lack of significant changes in students' level of reported empathy over time. Ilgunaite et al. (2017, p. 14) also warn of social desirability, highlighting the controversy associated with asking people with poor empathic skills to self-evaluate their own empathic abilities.
Concerns have been raised about what self-reports actually measure, reflecting one type of conceptualisation at the expense of others. For example, the Toronto Empathy Questionnaire used in Greeno et al.'s (2017) study views empathy primarily as an emotional process but leaves the cognitive components of perspective taking and self/other awareness unaccounted for. This reflects wider concerns regarding the validity of self-report questionnaires as an accurate measure of outcomes.
The finding that self-report scores did not significantly correlate with other measures that were used alongside them lends support to the claim that empathic attitudes are not 'a proxy for actions' (Lietz et al., 2011, p. 104). It is possible that skills training has more impact on students' behaviours than their attitudes, a point that was made by Barber (1988). Regardless of the varying explanations, self-report measures of empathy tell us very little about empathic accuracy (Gerdes et al., 2010(Gerdes et al., , p. 2334). The problems are not specific to the studies in this review or social work education in general. In an evaluation of empathy measurement tools used in nursing research, Yu and Kirk (2009) suggested that of the 12 measures they reviewed, none of them were 'psychometrically and conceptually satisfactory' (p. 1790). Schinke et al.'s (1978) study bucked the trend, finding students' positive attitudes towards their skills were almost three times higher among those who had received CST compared to those who did not.
Interestingly, the self-report instrument used in this study measured clearly specified counselling skills, and thus did not suffer from the conceptual confusion faced by those seeking to measure empathy.

Observer ratings
Observer ratings, conducted by independent raters, are often considered to be more valid and reliable measures of communication skills than the aforementioned subjective self-report measures.
Observation measures enable third party assessment of non-verbal and verbal behaviours to be undertaken. As Keefe (1979, p. 31) suggests, 'accurate' empathy when measured against a set of observer rating scales has been the basis for much valuable research and training in social work, particularly when combined with other variables. Observation measures were the primary instrument employed by the researchers of the included studies and produced the clearest demonstration of the effects of CST.
Studies using objective measures showed positive change, suggesting empathy training is effective. Studies using both selfreport and objective measures reported no significant changes in empathy using self-report but found higher levels of behavioural empathy when using objective measures. The same pattern was identified in a review of empathy training by Teding van Berkhout and Malouff (2015). As Greeno et al. (2017, p. 804) explain, perceived empathy is not correlated to actual empathic behaviours as scored by observers. Observation measures also posed some challenges for the studies included in this review, for example the repeated use of scales in training and assessment creates the problem of test-retest artefacts .
The Carkhuff (1969aCarkhuff ( , 1969b scales have been frequently used in social work education (Hepworth et al., 2010 Collins (1984) found, 'students were significantly better at writing minimally facilitative skill responses than demonstrating them orally as measured in a role-play interview (p. 124). Noting 'a lack of equivalence between written and oral modes of responding', the validity of the Carkhuff stems is challenged by Collins' study (Collins, 1984, p. 148

| Concerns about outcomes
The paucity of evidence-supported outcome measures in social work education has been apparent for some time (Holden et al., 2017), an issue we see reflected in this review.

Self-efficacy
Self-efficacy has been introduced as one means of assessing outcomes in social work education (Bell et al., 2005;Holden et al., 1997Holden et al., , 2002Holden et al., , 2005Unrau & Grinnell, 2005). Self-efficacy is deemed to be an important component of learning because 'unless people believe they can produce desired effects by their actions, they have little incentive to act' (Bandura, 1986, p. 3

| Potential biases in the review process
We performed a comprehensive search of a wide range of electronic databases and grey literature followed by the hand searching of key journals and reference searching of relevant studies. Both members of the review team screened all records and assessed all included studies against the inclusion criteria set out in the protocol, increasing consistency and rigour and minimising potential biases in the review process.
We sought to locate all publicly available studies on the effect of teaching and learning of communication skills in social work education during the review process, however it is difficult to establish if our endeavours were successful. It was a surprise to the first author that one of the included studies, which very clearly met the inclusion criteria, was obtained through reference searching, rather than through the electronic database search. As predicted by the second author, the age and style of the publication meant no key words were used, a search function upon which the electronic databases rely. Whilst this study came to light through reference searching, we cannot be entirely sure that other similar studies were surfaced in this way. Therefore publication bias cannot be entirely ruled out.
Our search was not limited to records written in English; indeed, one of the two unobtainable studies was written in Afrikaans, however, the rest of the studies were written in English.
Rather than indicating a limitation of the way the review was conducted, it is likely that the location of the studies is responsible for the language bias-all of the included studies were conducted in English-speaking countries, with the majority from the United States. Evidence-based practice is well established in the United States, contributing to the use of study designs that increase the likelihood of them being included in systematic reviews.
Uncertainties and differences of opinion were resolved through contacting study authors for further information and through further reading and discussion, without recourse for a third-party adjudicator. Both reviewers independently screened and assessed the studies.
We are not aware of other potential biases or limitations inherent within the review process.

| Agreements and disagreements with other studies or reviews
Findings from the included studies indicate that communication skills including empathy can be learned, and that the systematic training of student social workers produces improvements in their communication skills (Greeno et al., 2017;Larsen & Hepworth, 1978;Laughlin, 1978;Pecukonis et al., 2016;Schinke et al., 1978;Van-Cleave, 2007), at least in the short term.
The findings of this systematic review broadly agree with the knowledge reviews about communication skills produced for the Social Care Institute of Excellence (Luckock et al., 2006;Trevithick et al., 2004). The knowledge reviews highlight that despite a lack of evidence, weak study designs, and a low level of rigour, study findings for the teaching and learning of communication skills in social work education are promising.
Reviews of communication skills and empathy training in medical education (Aspegren, 1999;Batt-Rawden et al., 2013), where RCTs and validated outcome measures prevail, also suggest that CST leads to demonstrable improvements for students.
The findings from our review identified the same gaps as those found in the UK-based social work knowledge and practice reviews for social work education, suggesting that little has changed. Trevithick et al. (2004) suggest that interventions are under-theorised and the issue of whether students transfer their skills from the classroom to the workplace is unclear. Our findings concur with these observations. Diggins (2004)  Despite the limitations and variations in educational culture, the findings are still useful, and indicate that CST is likely to be beneficial.
One important implication for practice appears to be that the teaching and learning of communication skills in social work education should provide opportunities for students to practice skills in a simulated (or real) environment. Toseland and Spielberg (1982) suggest that skills diminish gradually if not reinforced. They suggest that students should be exposed to the effective application of interpersonal helping skills in several different courses and be encouraged to practice these skills in a variety of case situations role-played in classroom and laboratory settings, as well as in field settings. Larsen and Hepworth (1978) and Pecukonis et al. (2016) also suggest that CST must be better integrated with practice settings, where students can demonstrate communicative and interviewing abilities with actual clients in real-world practice settings, 'the ultimate test of any social work practice skill' (Schinke et al., 1978, p. 400).
Technology is widely used in the teaching and learning of Wretman & Macy, 2016). The extent to which this applies to outcomes of communication skills and empathy remains unknown. In this review the studies that compared face-to-face interventions with online interventions did not reach a consensus, since Ouellette et al. (2006) found there was no difference in outcomes between online and face-to-face teaching, whilst Greeno et al. (2017) and Pecukonis et al. (2016) found the outcomes of students who received live supervision were greater than those who engaged in self-directed study online. However, we do not know whether student outcomes were affected by the presence or absence of an educator.
Differences might not be attributable to the interventions themselves, for as Levin et al. (2018, p. 777) remark, 'the role of an instructor in online learning cannot be underestimated'.
Certainly, the proliferation of online social work courses is evident across Australia (Australian Association of Social Workers, 2018) and the USA (Council on Social Work Education, 2020). The global coronavirus/Covid-19 pandemic has led to exponential growth of online teaching and learning in social work education, hence 'we can be nearly certain that the 'new normal' will include the use of information technology' (Tedam & Tedam, 2020, p. 3). Therefore, it is imperative that we investigate the impact of online learning and web-based instruction and the role of the educator in different contexts on the development of social work students' communicative and empathic abilities.

| Implications for research
There is much to be done to improve the outcome studies in social work education generally and for the teaching and learning of communication skills in social work education specifically. Robust study designs that support causal inferences through the random allocation to intervention and control groups is a necessity. Steps to reduce threats to the internal validity of case-controlled studies should also be exercised to reduce the impact of test-retest artefacts identified by Nerdrum and Lundquist (1995) in some of the other studies. More work is needed on defining and measuring outcomes (Diggins, 2004). Validated measures which can be used consistently across future studies would make comparisons easier and enable future synthesis to be more meaningful.
The review found that relying solely on self-report measures was problematic, particularly given that the findings from these did not correlate with the findings produced from other measures. Vinton and Harrington (1994) found there was no statistically significant correlation between students' perceptions of their learning experience and self-assessment of their skill acquisition with the independent evaluator's rating of the students' acquisition of interviewing skills. Methodological triangulation should be considered in future studies.
Other study authors advise researchers to use objective follow-up studies, which would help determine the extent to which training benefits endure after the end of training (Schinke et al., 1978;VanCleave, 2007). Rawlings (2008) advises that a longitudinal design, testing the same students over time, is required. The need to investigate whether or not students were able to transfer their skills into practice has also been firmly stated (Carpenter, 2005).
In addition to outcome studies, VanCleave (2007)

ACKNOWLEDGEMENTS
We are particularly grateful to our stakeholders-the students, practitioners, people with lived experience, social work academics and social work organisations who gave their input into the development of this systematic review. The contribution of two research-minded social work students-Ryan Barber and Fee Steane is particularly appreciated.
Thank you to the editorial team at Campbell.
Emma Reith-Hall is in receipt of an ESRC studentship, for which she receives ESRC funding.

CONTRIBUTIONS OF AUTHORS
• Content: Emma Reith-Hall

DIFFERENCES BETWEEN PROTOCOL AND REVIEW
None.

SOURCES OF SUPPORT
Internal sources • Emma Reith-Hall and Paul Montgomery, UK No internal sources of support.

External sources
• Emma Reith-Hall, UK ERH is undertaking the systematic review as part of her PhD research, for which she receives ESRC DTP funding (Grant number: ES/P000711/1).

• Paul Montgomery, UK
No sources of support