The Validation of Automated Social Skills Training in Members of the General Population Over 4 Weeks: Comparative Study

Background Social skills training by human trainers is a well-established method of teaching appropriate social and communication skills and strengthening social self-efficacy. Specifically, human social skills training is a fundamental approach to teaching and learning the rules of social interaction. However, it is cost-ineffective and offers low accessibility, since the number of professional trainers is limited. A conversational agent is a system that can communicate with a human being in a natural language. We proposed to overcome the limitations of current social skills training with conversational agents. Our system is capable of speech recognition, response selection, and speech synthesis and can also generate nonverbal behaviors. We developed a system that incorporated automated social skills training that completely adheres to the training model of Bellack et al through a conversational agent. Objective This study aimed to validate the training effect of a conversational agent–based social skills training system in members of the general population during a 4-week training session. We compare 2 groups (with and without training) and hypothesize that the trained group’s social skills will improve. Furthermore, this study sought to clarify the effect size for future larger-scale evaluations, including a much larger group of different social pathological phenomena. Methods For the experiment, 26 healthy Japanese participants were separated into 2 groups, where we hypothesized that group 1 (system trained) will make greater improvement than group 2 (nontrained). System training was done as a 4-week intervention where the participants visit the examination room every week. Each training session included social skills training with a conversational agent for 3 basic skills. We evaluated the training effect using questionnaires in pre- and posttraining evaluations. In addition to the questionnaires, we conducted a performance test that required the social cognition and expression of participants in new role-play scenarios. Blind ratings by third-party trainers were made by watching recorded role-play videos. A nonparametric Wilcoxson Rank Sum test was performed for each variable. Improvement between pre- and posttraining evaluations was used to compare the 2 groups. Moreover, we compared the statistical significance from the questionnaires and ratings between the 2 groups. Results Of the 26 recruited participants, 18 completed this experiment: 9 in group 1 and 9 in group 2. Those in group 1 achieved significant improvement in generalized self-efficacy (P=.02; effect size r=0.53). We also found a significant decrease in state anxiety presence (P=.04; r=0.49), measured by the State-Trait Anxiety Inventory (STAI). For ratings by third-party trainers, speech clarity was significantly strengthened in group 1 (P=.03; r=0.30). Conclusions Our findings reveal the usefulness of the automated social skills training after a 4-week training period. This study confirms a large effect size between groups on generalized self-efficacy, state anxiety presence, and speech clarity.


Introduction
Background Social skills training (SST) has been widely adopted to help people who lack social and communication skills. It is used in hospitals, employment support facilities, workplaces, schools, and other institutions. A human trainer generally conducts SST to promote appropriate social and communication skills, strengthen the individual's social self-efficacy, and reduce social anxiety [1,2]. The method of Bellack et al [3], or step-by-step SST, is a well-structured and widely used evidence-based approach inspired by the 5 core principles of social learning theory: modeling, shaping, reinforcement, overlearning, and generalization. This method defines the SST framework and its 4 basic skills: expressing positive feelings, listening to others, making requests, and declining requests. However, it is cost-ineffective because those who need to receive training must visit the place where the training is conducted (eg, hospitals and employment support facilities). Accessibility is further limited due to the low number of professional trainers, especially in rural areas. Our goal is to provide SST anywhere and anytime.
Other research groups, along with our efforts, have been conducting studies to automate SST using conversational agents [4,5] or robots [6], and these works have led to the development of automatic SST [7][8][9][10][11][12] through simulating human-led SST [13][14][15]. Among the various features of conversational agents, our SST system includes video modeling of human behavior, automatic real-time behavior recognition, and feedback. Our system's specialty is to follow the procedure of human SST [1]. We also collected human SST data and integrated a social skills prediction model into the automatic system.
Several aspects of such a system have been evaluated in terms of training effect by short-term single-group intervention [13], appearance [14], and social self-efficacy [15]. However, the long-term effect of such an SST system and comparison with a control group have not been evaluated. Aside from SST, several studies have investigated the training effect of such mental health measures [16][17][18]. Single-group analysis was also performed in the context of social cognition training systems [19]. Much of this research used questionnaires, such as the Patient Health Questionnaire-9 [20], and emotion recognition tasks [21] to provide an evaluation scale. However, no consistent improvement or enhanced skills have been measured by questionnaires [16,19].
As a preliminary step, this study proposed to validate our SST system in members of the general population over 4 weeks. We separated the participants into 2 groups (with and without training) and clarified the training effect through evaluations before and after the 4-week training period. We hypothesized that the trained group would demonstrate a greater change in their social skills compared with the nontrained group.

Automated SST
We built a fully automated SST system using the Greta platform [4] and a conversational agent named Rei (Figure 1). Our system is capable of speech recognition, response selection, and speech synthesis and can also generate facial expressions, gestures, and head nods. Nonverbal behaviors are generated in the specific commands embedded in the dialogue responses. This system works in real time as a Windows application. The conversation agent's appearance and sex can be changed. Previously, we designed anime-type female characters and investigated the acceptability and trustworthiness of their appearance [14]. This system is also applicable for use by healthy people, not only people with social disorders.
We created 4 tasks based on the basic SST model as well as scenarios for them: declining requests, listening to others, making requests, and expressing positive feelings. These selected tasks reflect the 4 basic tasks used in the Bellack method. Among these, declining a request is the most difficult [3]. When declining a request or listening to another person, the initiative is with the system that holds the floor. For the other 2 tasks, the initiative is on the learner's side with the system in the responsive position.
After a brief greeting, the conversational agent explains to the participants the importance of the training task. The system records the users' voices and images with a pin microphone and a webcam to assess user behavior. During role-play, the system perceives the user's utterances by speech recognition and responds based on keywords ("yes," "I understand," "I have," etc) prepared for it by rule-based interaction scenarios ( Figure  1, left). We used Google Cloud Services for the speech recognition and speech synthesis modules. If the keywords were not directly included in the speech recognition output, we used the Bidirectional Encoder Representations from Transformers (BERT) model [22] to calculate the cosine distance to the above keywords at the sentence unit and chose the closest keywords. We created 7 role-play variations for each of the 4 tasks by referring to the SST data in 1-on-1 and 1-on-2 situations conducted by psychiatrists and people with autism spectrum disorders or schizophrenia, as well as healthy controls who were previously recorded [23]. The role-play variations included the following topics: hospital, home, school, workplace, and friends. In this study, for the selected scenarios, we excluded the hospital situation since our participants were from healthy populations.
We constructed a score evaluator from the role-play videos and automatically predicted 7 items: eye contact, body orientation, facial expression, vocal variation, clarity, fluency, and social validity. Each of these was rated by our psychiatrist evaluators on a 5-point scale. We predicted the ratings based on user behavioral indicators using multimodal features (Praat [24], OpenFace [25], and OpenPose [26]) and BERT [22] similarity scores between the utterances spoken by the conversational agent and the users. Random forest predicted the scores using these features, and 2 psychiatrists rated the ground truth of these values. We previously reported our detailed prediction performance and the correlation coefficient between the ground truth, and the predicted values was a maximum of 0.53 [27]. Depending on the evaluation results, a radar chart, positive comments, and corrective comments were presented on a screen with video clips, and the comments were read aloud by the conversational agent (Figure 1, right). The radar chart shows the evaluation values.
In SST, role-play rehearsals by participants are always immediately followed by positive feedback on what specifically a person did well. Here, a genuinely positive aspect must be found in even the poorest role-play performance. Our SST system praises the learner and provides positive reinforcement. The conversational agent and feedback are displayed through digital signage, and feedback is provided in full-screen view after the calculation is finished. The detailed SST system description can be found in a previous work [28].

Methods
We followed the CONSORT (Consolidated Standards Of Reporting Trials) checklist to describe the method [29].

Trial Design
We conducted a nonrandomized comparative study in members of a healthy population for a preliminary evaluation of the SST system. The allocation was not concealed for either the participants or the examiner. Participants could choose the group they wanted to register with depending on their availability for the number of visits to the examination room. The participants were not informed about the tasks of the other group. The trainers and third-party raters did not know the group participants. We conducted the study in July and August 2022, with July 23 being the pretraining evaluation date and August 28 being the posttraining evaluation date.

Ethics Approval
This study was conducted in accordance with the Helsinki Declaration guidelines. Ethical approval was obtained from the Nara Institute of Science and Technology (Reference No. 2018-I-1), and we obtained informed consent from all participants.

Participants
We asked a human resources company to advertise and recruit the participants. The eligibility criteria included (1) aged between 20 and 35 years and (2) a male and female balance of about 50%. We also attempted to balance the ages and sexes between the 2 groups. Exclusion criteria included (1) a history of epilepsy; (2) history of neurological disease; (3) history of substance abuse; (4) history of addiction; (5) history of head injury; (6) history of treatment at psychiatric hospitals; and (7) history of cataracts, amblyopia, strabismus, nystagmus, or ptosis. This exclusion criterion was originally designed to be used for future comparisons with people with autism spectrum disorders and schizophrenia.
In all, 13 Japanese participants were recruited for group 1 and 13 for group 2. Because half of the participants will wear eye trackers, they were asked not to wear colored eye contacts. Since this study was a preliminary step to test the system, we followed previous works regarding sample size [13,16,19]. We selected between 10 and 13 participants because this was a feasible number in terms of staff operation for a 1-day data collection of pre-and posttraining evaluations. We did not inform the participants of the SST system details in advance. They were also asked to come to both the pre-and posttraining evaluations. Group 2 was not trained by the system, so they were asked to come only to the pre-and posttraining evaluation sessions. The participants were paid an honorarium according to their group.
In all, 8 participants were ultimately unable to complete the experiment: 3 participants in group 1 and 3 participants in group 2 did not come to the pretraining evaluation; 1 participant in group 1 was unable to complete the training due to an isolation for COVID-19; and finally, 1 participant in group 2 did not come to the posttraining evaluation.
Two trainers and third-party raters, who are psychiatrists with SST experience, participated in our study.

Interventions
We invited group 1 to visit Nara Institute of Science and Technology's examination room every week (each Tuesday or Thursday) to use the system. Each training session consisted of SST with the conversational agent for 3 different skills from the basic 4 skills. The duration of the training per person was a maximum of 30 minutes. The sessions continued for 4 weeks, so participants attended 4 training sessions. None of the role-play scenarios were repeated in the 4-week training. The system was launched by the examiners (first or second authors) and worked automatically. One examiner (first or second author) remained in the same room with the participants to observe whether the system worked properly and safely. However, this potentially affected the intimacy between the participant and the conversational agent.
To evaluate the generalizability of social skills other than basic tasks, we also conducted 3 role-play scenarios that were not included in the basic tasks of Bellack et al [3] based on the performance test's manual [39] that require social cognition and expression of the participants in new role-play scenarios. We conducted a practice session on starting a conversation and performing 3 role-plays for scenarios that require social cognition and expression: (1) understanding and expressing empathic behaviors, (2) self-disclosure, and (3) social problem-solving with one's mother. We used Tobii Pro Glasses (version 3) to measure participants' eye gaze during the role-play. Due to the limited availability of eye-tracker equipment, half of the participants in both groups wore a glass-type eye tracker.
Two psychiatrists with SST experience joined this study as role-playing interlocutors. We controlled the condition so that there was no difference in the trainers between groups and the evaluation stages. Due to COVID-19 concerns, a transparent partition was placed between the participants and trainers. A video camera was placed behind each conversationalist to record the other individual at chest level from the front.
The participants were asked to answer questions regarding social anxiety (no anxiety at all=1, somewhat anxious and a slight effect on performance=3, and strong anxiety and unable to perform=5) and social self-efficacy (good=1, neither=3, and bad=5) for the role-playing scenario after performing it. We also collected the evaluation data of the 2 interlocutor trainers by third-party evaluators. Since these ratings are almost all maximum, we omit an analysis of the trainers' evaluations in this paper.
For the participant evaluation, we obtained third-party ratings using a Likert scale from 0 to 5 for eye contact, body direction and distance, facial expression, vocal variation, clarity, fluency, and social appropriateness [39]. For this evaluation, the third-party raters watched the recorded videos from the front-view, and the raters were not informed as to whether these were pre-or posttraining sessions or group 1 or group 2 (randomized by a computer).
Clarity evaluates how clearly and logically the participant is trying to express what they want to say. Since the required skills depend on each situation, the social appropriateness differs depending on each SST task. Let us explain examples of social appropriateness for the basic tasks of Bellack et al [3]. The task of listening to others, which determines whether the participants paid attention to the interlocutor, includes nodding, back-channel responses, and other nonverbal behaviors (eg, eye contact and smiling). For the task of expressing positive feelings, social appropriateness involves expressing attention to the interlocutor's responses and the suitability of the participant's speech content. For the task of expressing positive feelings, social appropriateness assesses whether they explained the details of their request, including what kind of help they need. It also includes whether they listened to the interlocutor. For the declining task, social appropriateness is concerned with whether they expressed contrition and appropriate reasons for their refusal. It also includes whether they proposed alternatives to the requests (eg, "I'm sorry but I propose to do it next time"), which is an important act for the situation.
After confirming areas of agreement, we calculated the averages of the 2 raters. For this study, we set the primary outcome as the third-party ratings and the secondary outcome as the other questionnaires.

Statistical Analysis
We calculated the difference values between groups in terms of post-and pretraining evaluations. We used 1-tailed Wilcoxson Rank Sum tests while generally hypothesizing that group 1 would show larger improvement than group 2. We also reported the effect size r between groups. We analyzed the subscales of the questionnaires in addition to the total scores. To confirm the agreement of third-party ratings, we calculated the intraclass correlation coefficient using the 2-way random-effects model with a consistency-type analysis [40]. For the performance test on 3 role-plays, we concatenated the 3 role-play results to calculate the statistics since there was no significant difference between the role-play scenarios. For the statistical analysis, we used the R software (R Foundation for Statistical Computing) [41]. Specifically, we used the stringr and irr libraries and the Wilcoxson Rank Sum test function.

User Statistics
Of the 26 recruits, 18 completed this study. Group 1 had 6 male and 3 female participants, and the mean age was 27 Table 1 presents the results of the pretraining evaluation and the difference between the post-and pretraining evaluations from the questionnaires. Effect size r and P values are also reported in the pre-post values. Note that we did not confirm significant differences between the 2 groups at pretraining evaluation values (all P>.10). We can see that the General Self-Efficacy Scale was significantly improved in group 1 compared with group 2 (P=.02; r=0.53). We can also see that state anxiety presence was significantly weakened in group 1 compared with group 2 (P=.04; r=0. 49). Regarding the performance test, the intraclass correlation coefficients between the 2 raters were as follows: eye contact=0.86, body direction and distance=0.80, facial expression=0.89, vocal variation=0.84, clarity=0.80, fluency=0.88, and social appropriateness=0.78 (all P<.001), which show good agreement. Results of the performance test showed significant changes in clarity in group 1 compared with group 2 (P=.03; r=0. 30). Regarding the other scores, we found that some results were prone to significant features, but there were no significant differences between groups.

Principal Results
Principal results include group 1 showing significant improvement in social skills (speech clarity). Our findings demonstrate the usefulness of the study in a 4-week training program. Our study confirmed the large effect size of the different groups. General self-efficacy improved and state anxiety presence was reduced because successful SST basically improves participant self-efficacy and lowers anxiety.

Comparison With Prior Work
This is the first prototype of an SST system based on human SST. Aside from SST, several other works have investigated the training effect of such mental health measures [16][17][18]. Single-group analysis has also been performed in the context of social cognition training [19]. Most of this work used questionnaires and performance tests [20,21]. However, there was no consistent improvement or enhancement of skills, as observed through this work's questionnaires. Additionally, there is no consistent questionnaire and evaluation criteria for SST in past studies, which makes it difficult to compare with this study. Previous evaluations include learning new social skills; improving assertiveness; hospital discharges; relapse rates; and effects on stress reduction, quality of life, symptoms, and hospitalization [3].
Our research demonstrates the system's usefulness in achieving a training effect during 4 weeks of training and confirms a large effect size between groups for generalized self-efficacy, state anxiety presence, and speech clarity (r>0.30). Human trainers generally conduct SST to promote appropriate social and communication skills, strengthen an individual's social self-efficacy, and reduce social anxiety [1,2].
Our SST system praises the learner and provides positive reinforcement, affecting gains in self-efficacy [15,42]. It can be argued that our SST strengthens self-efficacy, leading to behavioral modification [43]. We found a reduction of state anxiety presence and anxiety for role-playing since role-playing with the conversational agent is safe and provides successful user experiences to reduce anxiety. Another study also found that if encouragement and success experiences are present, the user experiences an increased sense of trust, security, and safety and a reduction in tension, threat, and anxiety [44]. Our primary outcome found improved speech clarity. Our SST system includes feedback with a concrete suggestion of behaviors, for instance, "You should provide more concrete reasons for declining." This suggestion also leads to improvement in speech clarity. Previous work also reported an improved frequency of positive social behavior by human SST [45]. Heinssen et al [46] reported that participants learned, retained, and generalized new social skills through human SST.
In the future, the system needs to become more sophisticated in terms of modeling behaviors and the agent's naturalistic nonverbal behaviors, including dyad synchronization in turn-taking or theory of mind [47,48].

Limitations
This study is limited in its number of participants, which we need to increase in our future work to investigate a broader range of participants, including those with autism spectrum disorders and schizophrenia. This study was not a randomized controlled trial because participants were permitted to choose which group they wanted to register with. We found that some scores of participants in group 2 moved in the opposite direction. The negative effect of the first contact with the evaluation task on many social variables might have influenced participants, especially those in group 2, and the positive effects of training may be overestimated. Moreover, our participants are from healthy populations, so the baseline scores at the pretraining evaluation were already high. We also did not find a significant difference between trait-related measures, for example, trait anxiety and the Social Responsiveness Scale. A follow-up evaluation will also be needed to observe skills generalization and transformation over a long-term period. Furthermore, this study did not evaluate in terms of emotion recognition tasks [21]. By observing the history of the system's development, we confirmed that some participants' scores did not change between the beginning and end of the training, which might be related to the boredom of repetitive training. We should clarify the optimal training duration and number of role-play repetitions in the future. In addition, we should compare the SST system with human SST in terms of training effects in adults and children [49].

Conclusion
This study validated an SST system using a 4-week training program. The trained group demonstrated significantly improved self-efficacy and reduced anxiety. Furthermore, we confirmed the large effect size in terms of speech clarity. We plan to extend this study to include people with autism spectrum disorders and schizophrenia, and the system needs to be further elaborated. It may also be impactful to a much larger group with different social pathological phenomena. We are now developing a web-based SST system that could be used as a supplement to human SST.