A technology prototype system for rating therapist empathy from audio recordings in addiction counseling

Scaling up psychotherapy services such as for addiction counseling is a critical societal need. One challenge is ensuring quality of therapy, due to the heavy cost of manual observational assessment. This work proposes a speech technology based system to automate the assessment of therapist empathy — a key therapy quality index — from audio recordings of the psychotherapy interactions. We design a speech processing system that includes voice activity detection and diarization modules, and an automatic speech recognizer plus a speaker role matching module to extract the therapist’s language cues. We employ Maximum Entropy models, Maximum Likelihood language models, and a Lattice Rescoring method to characterize high vs. low empathic language. We estimate therapysession level empathy codes using utterance level evidence obtained from these models. Our experiments showed that the fully automated system achieved a correlation of 0.643 between expert annotated empathy codes and machinederived estimations, and an accuracy of 81% in classifying high vs. low empathy, in comparison to a 0.721 correlation and 86% accuracy in the oracle setting using manual transcripts. The results show that the system provides useful information that can contribute to automatic quality insurance and therapist ∗Corresponding author Email address: boxiao@usc.edu (Bo Xiao) Preprint submitted to Journal of LTEX Templates March 4, 2016 PeerJ Comput. Sci. reviewing PDF | (CS-2015:10:7361:1:0:NEW 4 Mar 2016) Manuscript to be reviewed Computer Science


Introduction
Addiction counseling is a type of psychotherapy, where the therapist aims to support changing the patient's addictive behavior through face-to-face conversational interaction. Mental health care toward drug and alcohol abuse is essential to society. A national survey in the United States by the Substance 5 Abuse and Mental Health Services Administration [1] showed that there were 23.9 million illicit drug users in 2012. However, only 2.5 million persons received treatment at a specialty facility [1]. Further to the gap between the provided addiction counseling and what is needed, it is also challenging to evaluate millions of counseling cases regarding the quality of the therapy and the competence of 10 the therapists.
Unlike pharmaceuticals whose quality can be assessed during design and manufacturing, psychotherapy is essentially an interaction where multimodal communicative behaviors are the means of treatment, hence the quality is at best unknown until after the interaction takes place. Traditional approaches 15 of evaluating the quality of therapy and therapist performance rely on manual observational coding of the therapist-patient interaction, e.g., reviewing tape recordings and annotating them with performance scores. This kind of coding process often takes more than five times real time, including initial human coder training and reinforcement [2]. The lack of human and time resources prohibits sations [23,24]. These works demonstrate the feasibility of computationally modeling empathy through multimodal cues.
However, empathy estimation in the aforementioned work requires manual annotations of behavioral cues not only for training the empathy model, but also for application on new observations. Manual annotation on new observa-70 tion data prohibits large scale deployment of therapist assessment, as it costs a large amount of time and manual labor. A fully automatic empathy estimation system would be very useful in real applications, even though manual annotations are still required for training the system. The system should, for example, take the audio recording of the interaction as input, and return the therapist 75 empathy rating as its output, and no manual intervention would be needed in the process. In this work, we propose a prototype system that satisfies this requirement. This paper focuses on the computational aspect; for more discussion in a psychological aspect, please refer to [25]. We build the system by integrating state-of-the-art speech and language 80 processing techniques. The top level diagram of the system is shown in Fig. 1.
The Voice Activity Detection (VAD) module separates speech from non-speech (when they speak). The diarization module separates speakers in the interaction (who is speaking). The Automatic Speech Recognition (ASR) system decodes spoken words from the audio (what they say). And we employ role-specific 85 language models (i.e., therapist vs. patient) to match the speakers with their roles (who is whom). The above four parts comprise an automatic transcription system, which takes audio recording of a session as input, and provides time- Manuscript to be reviewed Computer Science segmented spoken language as output. For therapist empathy modeling in this paper, we focus on the spoken language of the therapist only. We propose three 90 methods for empathy level estimation based on language models representing high vs. low empathy, including using the Maximum Entropy model, the Maximum likelihood based model trained with human-generated transcripts, and a Maximum likelihood approach based on direct ASR lattice rescoring.
Given the access to a collection of relatively large size, well annotated databases 95 of MI transcripts, we train various models for each processing step, and evaluate the performance of intermediate steps as well as the final empathy estimation accuracies by different models. In the rest of the paper, Sec. 2 describes the modules and methods in the automatic transcription system. Then Sec. 3 describes the lexical modeling 100 of empathy. Sec. 4 introduces the real application data utilized in this work.
Sec. 5 describes the system implementation; and Sec. 6

Voice activity detection
Voice Activity Detection (VAD) separates speech from non-speech, e.g., silence and background noises. It is the first module in the system, which takes the audio recording of a psychotherapy session as input.
We employ the VAD system developed by Van Segbroeck et al. [26]. The 110 system extracts four types of speech features: (i) spectral shape, (ii) spectrotemporal modulations, (iii) periodicity structure due to the presence of pitch harmonics, and (iv) the long-term spectral variability profile. In the next stage, these features are normalized in variance; and a three-layer neural network is trained on the concatenation of these feature streams.

115
The neural network outputs the voicing probability for each audio frame, which requires binarization to determine the segmentation points. We use an adaptive threshold on the voicing probability to constrain the maximum length of speech segments. This binarization threshold increases from 0.5, until that all segments are shorter than an upper bound of segment length (e.g., 60s). Spoken 120 segment longer than that is infrequent in the target dyadic interactions, and not memory efficient to process in speech recognition. We merge neighboring segments on condition that the gap between them is shorter than a lower bound (e.g., 0.1s) and the combined segment does not exceed the upper bound of segment length (e.g., 60s). After the merging we drop segments that are too 125 short (e.g., less than 1s).

Speaker diarization
Speaker diarization is a technique that provides segmentation of the audio with information about "who spoke when". Separating the speakers facilitates speaker adaptation in ASR, and identification of speaker roles (patient, thera-130 pist in our application). We assume the number of speakers is known a priori in the application -two speakers in addiction counseling. Therefore, the diarization process mainly includes a segmentation step (dividing speech to speaker Manuscript to be reviewed Computer Science homogeneous segments) and a clustering step (assigning each segment to one of the speakers). 135 We employ two diarization methods as follows, and both of them take VAD results and Mel-Frequency Cepstrum Coefficient (MFCC) features as inputs.
The first method uses Generalized Likelihood Ratio (GLR) based speaker segmentation, and agglomerative speaker clustering as implemented in [27]. The second method adopts GLR speaker segmentation and Riemannian manifold 140 method for speaker clustering, as implemented in [28]. This method slices each GLR derived segment into short-time segments (e.g., 1s), so as to increase the number of samples in the manifold space for more robust clustering (see [28] for more detail).
After obtaining the diarization results we compute session-level heuristics 145 for outlier detection: e.g., (i) percentage of speaking time by each speaker, (ii) longest duration of a single speaker's turn. These statistics can be checked against their expected values; and we define an outlier as a value that is more than three times of standard deviation away from the mean. For example, a 95%/5% split of speaking time in the two clusters may be a result of clustering 150 speech vs. silence due to imperfect VAD. We use the heuristics and a rule based scheme to integrate the results from different diarization methods as described further in Sec. 5.

ASR
We decided to train an ASR using speech recordings from in-domain data 155 corpora that were collected in real psychotherapy settings. These recordings may best match the acoustic conditions (possibly noisy and heterogeneous) in the target application. In this work, a large vocabulary, continuous speech recognizer (LVCSR) is implemented using the Kaldi library [29].
Feature: The input audio format is 16kHz single channel far-field recording. the training corpus, e.g., mm as a filler word and vicodin as an in-domain word. We ignore low frequency out of vocabulary words in the training corpus 165 including misspellings and made-up words, which in total take less than 0.03% of all word tokens.

Speaker role matching
The therapist and patient play distinct roles in psychotherapy interaction; knowing the speaker role hence is useful for modeling therapist empathy. The Manuscript to be reviewed Computer Science language use. For example, a therapist may use more questions than the patient.
We expect a lower perplexity when the language content of the audio segment 195 matches the LM of the speaker role, and vice versa. In the following we describe the role-matching procedure in detail. 0. Input: training transcripts with speaker-role annotated, two sets of ASR decoded utterances U 1 and U 2 for diarized speakers S 1 and S 2 .
2. Mix the final LM used in ASR to the role-specific LMs by a small weight (e.g., 0.1), for vocabulary consistency and robustness.
3. Compute ppl 1,T and ppl 1,P as the perplexities for U 1 over the two role-205 specific LMs. Similarly get ppl 2,T and ppl 2,P for U 2 .
4. Three cases: (i) (1) holds -we match S 1 to therapist and S 2 to patient; (ii) (2) holds -we match S 1 to patient and S 2 to therapist; (iii) in all other conditions, we take both S 1 and S 2 as therapist.
5. Outliers: When the diarization module outputs highly biased results in 210 speaking time for two speakers, the comparison of perplexities is not meaningful. If the total word count in U 1 is more than 10 times of that in U 2 , we match S 1 to therapist; and vice versa.
6. Output: U 1 and U 2 matched to speaker roles.
When there is not a clear role match, e.g., in step 4, case III and step 5, we 215 have to make assumptions about speaker roles. Since our target is the therapist, Manuscript to be reviewed

Computer Science
we tend to oversample therapist language to augment captured information, and trade-off with the noise brought from patient language.

Therapist empathy models using language cues
We employ manually transcribed therapist language in MI sessions with high 220 vs. low empathy ratings to train separate language models representing high vs.
low empathy, and test the models on clean or ASR decoded noisy text. One approach is based on Maximum Likelihood N-gram Language Models (LMs) of high vs. low empathy respectively, previously employed in [20] and in [13] for a similar problem; we adopt this method for its simplicity and effectiveness.

225
Additional modeling approaches may be complementary to increase the accuracy of empathy prediction; for this reason we adopt a widely applied method -Maximum Entropy model, which has shown good performance in a variety of natural language processing tasks. Moreover, in order to improve the test performance on ASR decoded text, it is possible to evaluate an ensemble of 230 noisy text hypotheses through rescoring the decoding lattice with high vs. low empathy LMs. In this way empathy relevant words in the decoding hypotheses gain more weights so that they become stronger features. Without rescoring, it is likely that these words do not contribute to the modeling due to their absence in the best paths of lattices. In this work, we employ the above three approaches 235 and their fusion in the system.
For each session, we first infer therapist empathy at the utterance level, then integrate the local evidence toward session level empathy estimation. We discuss more about the modeling strategies in Sec. 7.1. The details of the proposed methods are described as follows.

Maximum Entropy model
Maximum Entropy (MaxEnt) model is a type of exponential model that is widely used in natural language processing tasks, and achieves good performance in these tasks [33,34]. We train a two-class (high vs. low empathy) MaxEnt model on utterance level data using the MaxEnt toolkit in [35]. Manuscript to be reviewed

Computer Science
Let high and low empathy classes be denoted H and L respectively, and Y ∈ {H, L} be the class label. Let u ∈ U be an utterance in the set of therapist utterances. We use n-grams (n = 1, 2, 3) as features for the feature function where j is an index of the n-gram. We define f j n (u, Y ) as the count of the j-th n-gram type that appears in u if Y u = Y , otherwise 0.  (3), where we denote the weight and partition function as λ j n and Z(u), respectively. In the training phase, λ j n is determined through the L-BFGS algorithm [36].
Based on the trained MaxEnt model, averaging utterance level evidences 255 P n (H|u) gives the session level empathy score α n , as shown in (4), where U T is the set of K therapist utterances.

Maximum likelihood based model
Maximum Likelihood language models (LM) based on N-grams can provide the likelihood of an utterance conditioned on a specific style of language, e.g., 260 P (u|H) as the likelihood of utterance u in the empathic style. Following the Bayesian relationship, the posterior probability P (H|u) is formulated by the likelihoods as in (5), where we assume equal prior probabilities P (H) = P (L).
We train the high empathy LM (LM H ) and low empathy LM (LM L ) using manually transcribed therapist language in high empathic and low empathic 265 sessions, respectively. We employ trigram LMs with Kneser-Ney smoothing by SRILM in implementation [32]. Next, for robustness we mix a large in-domain  (6).
We compute session level empathy score β n as the average of utterance level evidences as shown in (7), where U T is the same as in (4).

Maximum likelihood rescoring on ASR decoded lattices 275
Instead of evaluating a single utterance as the best path in ASR decoding, we can evaluate multiple paths at once by rescoring the ASR lattice. The score (in likelihood sense) rises for the path of an highly empathic utterance when evaluated on the empathy LM, while it drops on the low empathy LM. We hypothesize that rescoring the lattice would re-rank the paths so that empathy-related words . Compute the session level empathy score γ as in (9), where U T is the set of K lattices of therapist utterances. Manuscript to be reviewed Computer Science paths explicitly. It also allows more efficient averaging of the evidence from the top hypotheses.

Data corpora
This section introduces the three data corpora used in the study.

320
The recording format and transcription scheme are the same as TOPICS corpus. Each session is about 20 min.
All research procedures for this study were reviewed and approved by In- Manuscript to be reviewed Computer Science

Empathy annotation in CTT corpus
Three coders reviewed the 826 audio recordings of the entire CTT corpus, and annotated therapist empathy using a specially designed coding system - We use the mean value of empathy codes if the session is coded twice.
In the original study, three psychology researchers acted as Standardized Patient (SP), whose behaviors were regulated for therapist training and evalu- There are 133 unique therapists, and any therapist has no more than three sessions.

System implementation
In this section, we describe the system implementation in more detail. Table. 3 summarizes the usage of data corpora in various modeling and application steps.  (ii) using the automatically derived diarization results to segment the audio.
Role matching: We use the TOPICS corpus to train role-specific LMs for the therapist and patient. We also mix the final LM in ASR with the role-specific LMs for robustness.
Empathy modeling: We conduct empathy analysis on the CTT corpus.

385
Due to data sparsity, we carry out a leave-one-therapist-out cross-validation on CTT corpus, i.e., we use data involving all-but-one therapist's sessions in the corpus to train high vs. low empathy models, and test on that held-out therapist. For the lattice LM rescoring method in Sec. 3.3, we employ the top 100 paths (R = 100).

Experiment setting
We examine the effectiveness of the system by setting up the experiments in three conditions for comparison.
• ORA-D -ASR decoding of therapist language with manual labels of speech segmentation and speaker roles (i.e., using ORAcle Diarization and role labels), followed by empathy modeling on the decoded therapist language.

410
• AUTO -Fully automatic system that takes audio recording as input, carries out all the processing steps in Sec. 2 and empathy modeling in Sec. 3.
We setup three evaluation metrics regarding the performance of empathy code estimation: Pearson's correlation ρ, Root Mean Squared Error (RMSE) σ 415 between expert annotated empathy codes and system estimations, and accuracy Acc of session-wise high vs. low empathy classification.

Manuscript to be reviewed
Computer Science averages of session-wise values. We can see that ASR derived VAD information dramatically improves the diarization results in D 4 compared to D 2 that is based on the initial VAD.

Manuscript to be reviewed
Computer Science within turns. Therefore inherent errors exist in the reference data, but we believe they should not affect the conclusions significantly due to the relatively low ratio of such events. the classification accuracy Acc is in percentage. Manuscript to be reviewed Computer Science   Manuscript to be reviewed Computer Science estimation of empathy code. The performance for SP sessions is much higher 475 than that for RP sessions. One reason might be that SP sessions are based on scripted situations (e.g., Child Protective Serves takes kid away from mother who then comes to psychotherapy), while RP sessions are not scripted, and the topics tend to be diverse.  Nevertheless, this does not mean that ASR errors have no effect on the performance. We found that the dynamic range of the predicted empathy scores is smaller and more centered in the ORA-D and AUTO cases, showing a reduced 490 discriminative power.
There are some seemingly counter-intuitive results regarding Acc; e.g., in

Manuscript to be reviewed
Computer Science noisy text in the ORA-D case attenuates the representation of empathy, such effect is less critical for binary classification since it only concerns the polarity of high vs. low empathy rather than the actual degree. In Table. 9 ORA-D has slightly higher σ than ORA-T. This shows that ORA-D does not exceed 7. Discussion

Empathy modeling strategies
In this section we will discuss more about empathy and modeling strate-505 gies. Empathy is not an individual property but exhibited during interactions.
More specifically, empathy is expressed and perceived in a cycle [45]: (i) patient expression of experience, (ii) therapist empathy resonation, (iii) therapist expression of empathy, and (iv) patient perception of empathy. The real empathy construct is in (ii), while we rely on (iii) to approximate the perception 510 of empathy by human coders. This suggests one should model the therapist and patient jointly, as we have shown using the acoustic and prosodic cues for empathy modeling in [21,22].
However, joint modeling in the lexical domain may be very difficult, since patient language is unconstrained and highly variable, which leads to data spar-515 sity. Therapist language, as in (iii) above encodes empathy expression and hence provides the main source of information. Can et al. [46] proposed an approach to automatically identify a particular type of therapist talk style named reflection, which is closely linked to empathy. It showed that N-gram features of therapist language contributed much more than those of patient language. Therefore in 520 this initial work we focused on the modeling of therapist language, while in the future plan to investigate effective ways of incorporating patient language.
Human annotation of empathy in this work is a session level assessment, where coders evaluate the therapist's overall empathy level as a gestalt. In a long session of psychotherapy, the perceived therapist empathy may not be uniform Manuscript to be reviewed Computer Science across time, i.e., there may be influential events or even contradicting evidence.
Human coders are able to integrate such evidence toward an overall assessment.
In our work, since we do not have utterance level labels, in the training phase we treat all utterances in high vs. low empathy sessions as representing high vs. low empathy, respectively. We expect the model to overcome this since 530 the N-grams manifesting high empathy may occur more often in high empathy sessions. In the testing phase, we found that scoring therapist language by utterances (and taking the average) exceeded directly scoring the complete set of therapist language. This demonstrates that the proposed methods are able to capture empathy on utterance level. to the class of the averaged code value. In Table. 10 we list the counts of coder disagreement. We see that the ratio of human agreement to the averaged code is around 90% on the CTT corpus. This suggests that human judgment of empathy is not always consistent, and the manual assessment of therapist may not be perfect.

545
However, human agreement is still higher than that between the average code and automatic estimation (results in Manuscript to be reviewed Computer Science computational assessment as an objective reference may be useful for studying the subjective process of human judgment of empathy.    Manuscript to be reviewed Computer Science Table. 11 and Table. 12 show the bigrams and trigrams with extreme δ values, i.e., phrases strongly indicating high/low empathy. We see that high empathic words often express reflective listening to the patient, while low empathic words 560 are often questioning or instructing the patient. This is consistent with the concept of empathy as "trying on the feeling" or "taking the perspective" of others.

Robustness of empathy modeling methods
In this section we demonstrate the robustness of the Lattice Rescoring method  respectively. For figure clarity we display the mean and standard deviation for every 10 sample points (e.g., the first point represents the statistics of sampling indices 1, 6, · · ·, 46). Meanwhile, the asterisks show the performance when using the 1-best decoded paths.
In Fig. 3  In practice, if the empathy LM is rich enough, one can also decode the utterance directly using the high/low empathy LMs instead of rescoring the lattice.

Conclusion
In this paper we have proposed a prototype of a fully automatic system to 595 rate therapist empathy from language use in addiction counseling. We constructed speech processing modules that include VAD, diarization, and a large vocabulary continuous speech recognizer customized to the topic domain. We employed role-specific language models to identify therapist's language. We For evaluation, we estimated empathy using manual transcripts, ASR decoding using manual segmentation, and fully automated ASR decoding. Exper- 605 imental results showed that the fully automatic system achieved a correlation of 0.643 between human annotation and machine estimation of empathy codes, as well as an accuracy of 81% in classifying high vs. low empathy scores. Using manual transcripts we achieve a better performance of 0.721 and 86% in correlation and classification accuracy, respectively. The experimental results 610 show the effectiveness of the system in therapist empathy estimation. We also observed that the performance of the three modeling methods are comparable in general, while the robustness varies for different methods and conditions.
In the future, we would like to improve the underlying techniques for speech processing and speech transcription, such as implementing more accurate VAD, 615 diarization with overlapped speech detection, and a more robust ASR system.
We would also like to acquire more and better training data such as by using close talking microphones in collections. The use of close-talking microphones may fundamentally improve the accuracy of speaker diarization. As a result acoustic and prosodic cues may be integrated into the system, which relies on 620 robust speaker identification. The system may be augmented by incorporating other behavioral modalities such as gestures and facial expressions from the visual channel. A joint modeling of these dynamic behavioral cues may provide a more accurate quantification of therapist's empathy characteristics.
Appendix A. Note on data sharing 625 Restrictions would apply to release the data corpora we used in the experiments for two reasons. First, our work is a secondary study analyzing data of archived recordings of counseling sessions, which cannot be fully anonymized.
Thus the data cannot be released to the public. The only exception is the "General psychotherapy corpus" as a collection of psychotherapy transcripts. We Manuscript to be reviewed Computer Science seling and Psychotherapy Transcripts, Client Narratives, and Reference Works (http://alexanderstreet.com/products/counseling-and-psychotherapytranscripts-series). Second, all of the available original audio recordings were from third parties. The primary authors were not responsible for the col-635 lection of the original data, which was pulled from 6 different clinical trials. We list these specific studies and PI information as the following. We would like to point out that despite the constrains on data, the methods proposed in this work and the system we described are generally applicable 650 to empathy estimation in Motivational Interviewing. We expect the results to be reproducible on audio data that are in similar nature to the data in our experiments.