1 Introduction

Inclusive education is based on treating individuals with diverse capabilities attentively through a perfect tutoring procedure [1, 2]. By way of explanation, inclusive education provides an opportunity for learners with differentiating characteristics to be educated in an equitable context and acquire further training achievements. Several studies recently conducted in the field of education revealed that the educational process benefits from the relationship between educational assistive tools and learners much more than the relationship between tutors and learners [3]. Hence, an important aspect of teaching is utilizing appropriate educational tools to improve the students’ acquisition and engagement. Social robots are novel educational assistive technologies that assist educators by promoting learning efficiency [4].

Delayed Speech Development (DSD) is an incapacity to deploy communicational skills in infancy [5]. This disruptive delay is frequently accompanied by mental retardation, which can influence toddlers’ socialization [6, 7]. Accordingly, these children face tremendous challenges in expressing their thoughts and comprehending information from their environment, leading them to be classified as socially vulnerable children. Statistics indicated that 8–10% of all preschoolers face DSD problems [8]. Ordinarily, the disorder’s initial symptoms appear around 18 months when a child does not make any effort to repeat the words they hear. At 24 months, their vocabulary is restricted to single words, and at 30–36 months, an apparent lack of skill in making sentences can be observed. Generally, these children are only able to use memorized phrases gathered from games or animations [9]. Speech therapy sessions mitigate some of the problems related to language disorders and hone special needs children’s communicational and verbal skills. In recent decades, employing audio and visual content during speech therapy sessions has become very popular among speech therapists due to their potential benefits in increasing the efficiency of the interventions [10,11,12]. However, these tools only lead to one-way interaction; in other words, children’s responses are not reciprocated in these approaches; therefore, a conversation, which is a prerequisite to communication, cannot be formed [13].

Among various assistive technologies utilized in therapy sessions, social robots have received growing attention in recent years due to their potential role as mediators between therapists and children [4, 14, 15]. Involving social robots in clinical settings increases participants’ attention, improves individuals’ social behaviors, and sustains their level of engagement during therapeutic interventions. The encouraging implications of employing social robots in therapy interventions for individuals with various impairments, such as Autism Spectrum Disorder (ASD) [16,17,18], Down syndrome [19, 20], and Hearing impairments [21], have underscored the encouraging prospects of these promising assistive tools in terms of providing equal educational opportunities for special needs children [22]. A noteworthy characteristic of employing a social robot as an assistive tool in therapy sessions is the two-way interaction formed between the robot and the child, which encourages different aspects of the children’s behaviors, such as attention span and willingness to learn.

This research endeavors to explore the potential benefits of employing the RASA robot [23, 24] in speech therapy sessions via quantitative analysis. In this regard, two scenarios were carried out: the first was associated with comparing the children’s engagement level in an imitation game played with the therapist and the robot, and the second was concerned with investigating the efficacy of the robot’s presence in speech therapy sessions. The children’s awareness of the robot’s capabilities to recognize its users’ facial expressions and express several emotional states forms a positive preconception about the robot’s intelligence level and the complexity of its behaviors, which helps to sustain the children’s engagement through long-term interaction with the robot [25]. Additionally, when children assess a social robot, the delight they have experienced through their interactions affects the acceptance of the robot. Enjoyment is a crucial element in the investigation of social robots’ acceptance; it diminishes the individuals’ anxiety and makes them feel more confident about their ability to communicate with this technology [26,27,28]. Thus, the first scenario could benefit the second by increasing the children’s willingness to approve the social robot as an educational assistive tool in speech therapy sessions. To scrutinize the impacts of the robot’s presence in the two scenarios, two groups of children were recruited to participate in our examinations: the first one (the intervention group) participated in robot-assisted speech therapy (RAST) sessions and the second one (the control group) participated in traditional therapy sessions. Each group of participants was comprised of six children with language disorders (four males and two females) with a mean age of 6.4 years. To accomplish the robot’s objective in terms of interacting with children in the RAST sessions, a facial expression recognition system and an accurate lip-syncing system were developed and implemented on the RASA robot. To do this, several well-established Convolutional Neural Network (CNN) architectures were trained on the AffectNet emotional database [29] and modified via the transfer learning technique performed on the CK+ database to become more suitable for the RAST sessions. The other significant aspect of this study that distinguishes it from previous works [3, 4, 30] is the design of the robot’s mouth, which can precisely synchronize lip movement with the robot’s speech. In this way, children with language disorders can better learn the exact pronunciation of each word by concentrating on the robot’s lips.

The rest of the paper is organized as follows: Sect. 2 is devoted to elucidating related works. Section 3 explains the design of the affective interaction system composed of three main phases: recognizing facial expressions, expressing various emotions, and implementing the system on the RASA robot. Section 4 describes the development of an appropriate human-like lip-syncing system for the robot. This section explains the design of the robot’s visual articulatory elements for each Persian phoneme in detail, along with the algorithm utilized to attain human-like lip-syncing. Section 5 gives details about the experimental procedure and discusses the two scenarios carried out in the study. The first scenario sought to answer a primarily exploratory question: Are children with language disorders more inclined to play a collaborative emotional imitation game with a social robot or a therapist? The second scenario aimed to explore the effects of the robot’s presence in speech therapy interventions on the individuals’ language skills development and was assessed by comparing the progress of two groups of children with language disorders, those who participated in robot-assisted therapy sessions with those who took part in conventional sessions. Section 6 analyzes the results and discusses the outcomes via conducted statistical tests. The assessment tools used in this analysis include video coding for the first scenario and the Persian version [31] of the Test of Language Development (TOLD) [32] for the second scenario. The following section discusses the limitations of this study and future works. Finally, the conclusion is drawn in Sect. 8. An overview of the study is shown in Fig. 1.

Fig. 1
figure 1

An overview of the study

2 Related Work

2.1 Robot-Assisted Speech Therapy

According to studies investigating the potential of social robots in speech therapy interventions for children suffering from different impairments such as ASD [33], Cleft Lip and Palate (CL/P) [34], Cerebral Palsy (CP) [35], Hearing impairments [36, 37], and DSD [38], the presence of a robotic assistant is beneficial in terms of providing incentives for children to participate in therapy sessions and improving their verbal and communication skills. For example, in Ref. [38], Zhanatkyzy and Turarova used the NAO robot to investigate the effectiveness of robot-assisted therapy (RAT) sessions. They conducted their experiments for two weeks (three sessions per week) with four DSD children between four and six years old. In this study, the robot played the role of entertainer by performing dances, playing games, and telling several fairy tales. The study results suggested that RAT could be regarded as a practical approach to encourage DSD children and facilitate their development in pronouncing simple sentences and singing well-known songs with the robot. However, the robotic platforms utilized in references [33,34,35,36,37,38] did not possess precise visual articulatory systems; consequently, the RAT scenarios conducted in these studies were based on auditory-verbal therapy (concerned with developing auditory and verbal skills) and were ineffective in enhancing children’s capabilities with regard to lip-reading and perceiving other non-verbal cues [37].

2.2 Facial Communication Channels in HRI

By and large, blurring the distinctions between therapists and socially assistive robots in terms of communication methods used to interact with children could lead to progress in human–robot interaction (HRI). Moreover, real-time interaction between children and robots can positively affect both the learning process and social development [39, 40]. Thus, augmenting human-like features to a socially assistive robot, such as real-time recognition and expression of emotional states, body gestures, and lip-syncing, makes the robot more socially acceptable [41].

2.2.1 Facial Expression Recognition in HRI

Following advances in computer vision technologies, developing emotional facial expression recognition systems for social robots via various machine learning algorithms and promoting the robots’ emotional intelligence have been trending upward [42,43,44,45]. Since facial cues are essential elements in an affective interaction, their recognition and expression lead to more in-depth communications [46,47,48,49,50,51]. Furthermore, the more extravagantly the social robot behaves, the more it encourages children to remain engaged through long-term interactions with the robot [25]. In social robots’ acceptance, sociability is a primary factor attributed to the users’ opinions about the robot’s social, emotional, and cognitive skills [28]. Hence, the robot’s capabilities in terms of recognizing and expressing various emotional states influence the individuals’ evaluation of the robot’s intelligence level and heighten the robot’s acceptance. Different machine learning methods, e.g., deep learning, have been extensively used in the literature to promote social robots’ emotional intelligence [52]. Ref. [53] trained the Xception architecture [54] on the FER-2013 database [55] and implemented the trained model on the Pepper humanoid robot. In that study, the robot was able to recognize pedestrians’ emotions (neutral, happiness, sadness, and anger) and consider their emotional states to perform emotion-aware navigation while moving among them. In Ref. [47], the VGG16 Network [56] was trained on the FERW database to develop a model capable of recognizing seven basic emotions; the trained model was then implemented on the XiaoBao robotic platform to improve the quality of the robot’s interactions.

2.2.2 Lip-Syncing in HRI

Lip-syncing is a key factor in human–human interactions, and its precise presentation could result in a better perception of the communicators’ purposes [57]. The visual components of human articulatory systems (lips, tongue, teeth, and jaw) and their motions convey the sounds generated by the vocal tract [58]. Due to the importance of multimodal communication in social robotics, many studies have focused on synchronizing lip movements with speech to take advantage of audio-visual information [59]. Cid and Manso [60] concluded that a robot’s verbal articulation could be improved by compounding two sources of signals, auditory cues (pitch, pause, and emphasize) and visual cues (lip motions, facial expressions, and body gestures). Ref. [61] found that possessing a dynamic and human-like mouth could increase the acceptance of the robot. The significance of this type of mouth for a socially assistive robot is much more critical in speech therapy interventions where the ultimate goal is emulating natural speech.

3 Emotional Interaction System Design

3.1 Facial Expression Recognition System

As previously mentioned, social robots with the capacity to interact with children emotionally can substantially attract their attention [62]. Generally, emotional interaction is comprised of emotion recognition and expression that can be conveyed through assorted audio and visual channels, including facial cues, which are a primary way of displaying feelings in human–human interactions. End-to-end neural networks are ubiquitously utilized among different machine learning algorithms for facial expression recognition tasks. The two principal aspects of developing a well-trained model for a recognition task are adopting proper databases and suitable architectures.

Developed by Mohammad H. Mahoor and his colleagues in 2017, AffectNet is one of the most comprehensive wild emotional datasets comprising approximately 1 M web images [29]. This dataset consists of two main parts, manually and automated annotated images. Manually labeled images, the focus of this study, are classified into eight expressions and three invalid emotion categories: neutral, sad, happy, surprise, fear, anger, disgust, contempt, none, uncertain, and non-face. It should be noted that invalid emotion categories (none, uncertain, and non-face) were not considered in the training process of the current study. Due to the copious number of annotated images and wild hallmarks of the dataset, training an appropriate CNN architecture on this dataset will yield a well-trained model with superior generalization capability, which can be used in real-world applications.

The extended Cohn-Kanade (CK+) is another standard facial expression dataset developed by Patrick Lucey [63]. With only a tiny number of samples consisting of 327 sequences across 123 subjects gathered in a controlled condition, it superficially resembles the sequences captured by the RASA robot’s head-mounted camera in the laboratory. In this paper, similar to [64, 65], the last three frames of each labeled sequence were categorized as one of the basic emotions, and the sequences’ first frames were extracted as neutral. Table 1 summarizes the total number of images per expression for each dataset.

Table 1 The total number of images in the AffectNet and CK+ datasets per each expression [29, 65]

Sample images of the AffectNet and CK  datasets and an image captured by the robot’s camera in the lab environment are shown in Fig. 2.

Fig. 2
figure 2

a An image captured by the robot’s camera in the experimental setup, b a sample image of the AffectNet [29], and c a sample image of the CK+ datasets [63]

As the figure shows, the images captured by the robot’s camera and the CK+ images have two conspicuous similarities; both were captured in a straight-ahead position and a standard lab environment.

In this study, the facial expression recognition system was designed and implemented on the robotic platform in three steps. In the first step, several noted architectures were trained on the AffectNet dataset, and the results were compared through various evaluation metrics, such as accuracy, F1-score [66], Cohen’s kappa [67], and area under the ROC curve (AUC) [68], to achieve an accurate model. Afterward, to enhance the system’s performance in interactions with the robot’s users in the laboratory, the model (selected according to its performance on the AffectNet test set) was then adapted via the transfer learning technique conducted on the CK+ dataset. Finally, the modified model was implemented on the RASA robot.

3.1.1 Step One: Model Training

Several well-known CNN architectures, including MobileNet [69], MobileNet v2 [70], NASNET [71], DenseNet169 [72], DenseNet121 [72], Xception [54], Inception v3 [73], and VGG16 [56], with satisfactory performance on the ImageNet dataset [74], were trained on the AffectNet dataset. According to the dataset’s instruction manual, faces were cropped and resized to 224 × 224. Then, the corresponding preprocesses were applied to the images for each network. In order to achieve a better-generalized model, data augmentation was performed via three standard techniques: rotation (from -10 to 10 degrees), translation (up to 10% in both x and y directions), and horizontal flipping. The Adam optimizer was utilized with a learning rate of 1e-5 and a momentum of 0.9. The weighted-loss function was also used to compensate for the adverse effects of the imbalanced training set. The mentioned networks were trained over ten epochs. For each network, the maximum batch size was limited by the available memory of the hardware: 64 for MobileNet, 64 for MobileNet v2, 64 for NASNET, 32 for DenseNet169, 32 for DenseNet121, 16 for Xception, 16 for Inception v3, and 8 for VGG16. All the networks were trained on an NVIDIA GeForce GTX 1080Ti GPU using Keras framework. Table 9, presented in the “Appendix” section, summarizes the accuracy of the trained models. A comparison of the various networks’ accuracies led us to adopt the MobileNet architecture for the facial expression recognition task due to the number of parameters and superior performance on the AffectNet test set. The confusion matrix of the MobileNet model is shown in Table 2.

Table 2 Confusion matrix of the trained MobileNet architecture on the AffectNet test set

Other evaluation metrics for the CNNs mentioned above are also concisely presented in Table 10 in the “Appendix” section. It is worth noting that the AffectNet dataset’s annotators concurred with each other on 60.7% of the images [29].

3.1.2 Step Two: Model Adaptation

In this step, the MobileNet model, chosen in the previous step, was evaluated on the CK+ dataset and tuned by the transfer learning technique. The dataset was split into train and test sets to assess the network’s performance on the CK+ . The splitting procedure was subject-based, so 10% of the subjects, randomly selected, formed the test set. The face detection for this dataset was done by the Viola-Johns method [75], and the previous preprocesses and augmentation techniques were applied, as explained above. While the features extracted in the early layers of a CNN model are more generic, the ones extracted from later layers are more dataset-specific. Hence, to optimize the model on the CK+, the 20 earliest layers’ parameters were frozen, and the other layers’ parameters were tuned over ten epochs. Table 3 represents the accuracy and the evaluation metrics of the MobileNet model on the CK+ test set before and after performing transfer learning on the CK+ train set.

Table 3 A comparison of the MobileNet evaluation metrics on the CK+ test set, before and after transfer learning

As Table 3 presents, transfer learning improved the model’s performance on the CK+ test set. Due to the similarity between the study’s experimental environment and the CK+ , we could reasonably expect to acquire a more precise facial expression recognition system after the tuning.

3.1.3 Step Three: Implementing the Facial Expression Recognition System on the RASA Robot

The humanoid robotic platform utilized in the study was RASA, designed and manufactured at CEDRA (Center of Excellence in Design, Robotics, and Automation) at the Sharif University of Technology [23, 24]. This socially assistive robot aims to interact with special needs children. Figure 3 displays the employed robotic platform.

Fig. 3
figure 3

The RASA socially assistive robot

The robot’s abilities to perform real-time recognition and react authentically are critical factors in providing a natural interaction. Hence, due to the limited power of the graphics processing unit of the robot’s onboard computer, it would be beneficial to use an external graphics processing unit to execute the facial expression recognition task’s computational cost. Accordingly, an external NVIDIA GeForce RTX 2070 GPU was deployed to do graphical computations. To implement the developed emotional system on the RASA, the robot’s onboard camera first captured the user’s image. Next, a ROS node was used to stream the image topic. Then, a python code was developed to capture live stream video from the robot’s IP and apply Viola-John’s face detector algorithm [75] to the received data. Following the face detection, the CNN model was used to predict the user’s facial expression. By way of response, the proper reaction, according to the HRI scenario, was selected and published on a ROS topic. Ultimately, the robot reacted according to the subscribed message. In this scheme, only the tasks of streaming the video and subscribing to messages were loaded onto the robot’s onboard computer.

3.2 Facial Expressions

To achieve a two-way interaction between the robot and a child, not only is it essential to recognize the child’s emotional state, but the robot must also depict a justifiable expression. Therefore, designing appealing facial expressions for the robot is crucial. In the current study’s speech therapy scenarios, the robot should be able to convey emotional messages and enunciate letters and words simultaneously. Thus, the robot’s emotional expressions should not depend only on articulatory visual elements. Hence, several other components, such as eyes, eyebrows, and cheeks, were also considered in the design of the robot’s emotional states. In this way, the robot will be able to express emotions and lip-sync concurrently.

Figure 4 depicts the eight emotional states designed for the robot’s face.

Fig. 4
figure 4

Designed emotional states of the robot

4 Lip-Syncing System

4.1 Graphic Design

Developing a lip-syncing system with realistic articulators could boost the robot’s efficacy in the RAST sessions. To achieve a perceptible visual articulatory system, an Iranian sign language interpreter was hired to pronounce Persian phonemes (including vowels and consonants), and the articulators were thoroughly sketched based on the images captured from him in a straight-ahead position. Figure 5 illustrates the procedure of sketching the robot’s visual articulatory elements for a particular phoneme.

Fig. 5
figure 5

The procedure of designing the robot’s articulators for a particular phoneme

Figure 6 shows the individual shapes sketched for Persian phonemes, including twenty-two consonants and six vowels.

Fig. 6
figure 6

The sketches of the articulatory elements for the Persian phonemes and the normal state

4.2 Morphing Algorithm

The algorithm proposed for the lip-syncing system executes according to a three-step process, including receiving a word, disassembling it into basic phonemes, and morphing them into each other smoothly. In the procedure of morphing a phoneme into its subsequent one, the deformation of the articulators should be minimized to achieve a natural visual articulation. Furthermore, although minor defects in the sketches could be ignored by spectators, discrete and unnatural transitions of elements are not permissible. Following a path between the initial and final points with a constant velocity throughout the transition procedure adversely affects the fluidity of movement and leads to unnatural motions [76]; this problem could be addressed by adding acceleration terms [77].

In animation jargon, an easing function describes the way that the transition from an initial point to a final point occurs by determining the velocity and acceleration terms. Figure 7 demonstrates some well-established easing functions.

Fig. 7
figure 7

Penner’s easing functions [76]

After examining the easing functions presented in Fig. 7, the “InOutExpo” function was selected due to its capability to provide a natural and smooth transformation. The equation of this function is given by [76]:

$$ y = \left\{ {\begin{array}{*{20}l} 0 \hfill & {x = 0} \hfill \\ {2^{20x - 11} } \hfill & {x \in \left( {\left. {0,\frac{1}{2}} \right]} \right.} \hfill \\ {1 - 2^{ - 20x + 9} } \hfill & {x \in \left( {\frac{1}{2},1} \right)} \hfill \\ 1 \hfill & {x = 1} \hfill \\ \end{array} } \right. $$
(1)

The smooth transition from a particular articulatory element of a phoneme to its corresponding component in another phoneme within successive frames could be accomplished by dividing each shape into numerous points and employing an easing function to determine the points’ transition characteristics. The two shapes’ corresponding points are determined by minimizing the deformation according to the following penalty function:

$$ J = \sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{N} \left( {x_{i} - \hat{x}_{i} } \right)^{2} }}{N}} $$
(2)

Thus, the transition problem is simplified to the optimization of the least square penalty function presented in Eq. (2). Increasing the number of points makes the transition smoother; however, it accrues more computational cost. In this study, the KUTE library was deployed for morphing the articulators’ vector shapes sketched by Adobe Illustrator. The sign language interpreter and the speech therapist assessed the caliber of the proposed articulatory system by visual examinations. Figure 8 demonstrates the procedure of lip-syncing a Persian word.

Fig. 8
figure 8

The procedure of lip-syncing (/ɒːb/), which means water in Persian

5 Methodology

In this study, two groups of children with language disorders were investigated to assess the efficacy of utilizing social robots as assistive tools in speech therapy. The first group (the intervention group) was enrolled in the RAST interventions, while the second group (the control group) participated in conventional speech therapy sessions. The under-investigation groups underwent two scenarios: a ten-minute imitation game and a set of thirty-minute speech therapy sessions.

The first scenario was designed to scrutinize the children’s engagement level through a mimic game played with the robot (for the intervention group) and the therapist (for the control group). This examination endeavored to compare the duration participants looked at the therapist/robot employing the manual video coding technique. The second scenario was formulated to examine the potential of deploying social robots in speech therapy sessions for children with language disorders. This inspection investigated the language skills progression of individuals who participated in the RAST sessions and traditional therapeutic interventions. To accomplish this objective, the Persian TOLD was taken from the intervention and control groups in two phases: a pre-test and a post-test. The pre-intervention scores were utilized to assess the comparability of the two groups’ initial language levels, and the post-intervention scores were used to investigate the efficacy of the RAST sessions.

Each group participated in ten one-to-one therapy sessions, one session per week. The first session (Week #1) was devoted to familiarizing the children with the experimental setup (to negate the novelty factor effects) and administrating the pre-test. The second session (Week #2) was dedicated to performing the first scenario, the imitation game, and the other sessions were held to explore the social robot’s potential in speech therapy. One week after the final speech therapy session, the post-test was given to both groups of children.

5.1 Participants

In this exploratory study, the intervention group was made up of six native Persian-speaking children with language disorders (two female, four male) with an average age of 6.4 years and a standard deviation of 2.2 months. The control group consisted of the same number of native Persian-speaking children, the same distribution of genders, and an average age of 6.4 years with a standard deviation of 1.9 months. In order to make a precise evaluation, both groups of children were selected from individuals who attended weekly traditional speech therapy sessions at the Mahan Language Disorders Center. Furthermore, they were asked not to participate in any other therapy sessions from two weeks before the start of our investigation until the end of it. According to a post-hoc power analysis conducted via G*Power 3.1 Software [78], for the sample size of N = 6 per group, the power of this pilot study is 12% considering a medium effect size of 0.5 and a significance level of 0.05%.

5.2 Experimental Setup

The RAST sessions were conducted at the Social and Cognitive Robotics Lab at the Sharif University of Technology. Three cameras, two located in the room’s corners and one mounted on the robot’s head, recorded all interventions. In all sessions, the speech therapist was present beside the robot. In another room (control room), the robot’s operators controlled and narrated the robot’s dialogues, synchronizing their voice with the lip-syncing of the robot through videos they received from the RAST sessions. Hence, a real-time human voice was synchronously compounded with the robot’s articulation to communicate with the children. Two speakers were also placed in the room to play the filtered operator’s voice, which was made to sound like a child by changing its pitch. The schematic of the experimental setup, including the intervention and control room, is shown in Fig. 9.

Fig. 9
figure 9

The schematic of the intervention and control rooms

The conventional speech therapy sessions of the second group were also held in the aforementioned experimental setup without the social robot’s presence to eliminate the environmental conditions’ impact.

5.3 Intervention Scenarios

Speech therapy aims to hone individuals’ communication skills and enhance the participants’ abilities to grasp and express thoughts, ideas, and feelings. Therefore, engaging children and boosting their achievements from therapy sessions are two pre-eminent factors that should be considered in the interventions.

5.3.1 Scenario One: The Investigation of the Children’s Engagement Level via an Imitation Game

The current scenario was designed to explore whether or not children with language disorders are more engaged during interaction with the social robot than with the therapist. Scenario one was a facial expression mimicry game that required children to stand in front of a playmate (robot/therapist) and imitate its facial expressions. The scenario was conducted in a ten-minute intervention where the child’s playmate revealed a random facial expression and waited until the child emulated the same emotional state. The playmate would express the next emotion when the imitation was performed correctly. The intervention group played with the robot throughout the scenario, while the control group played with the therapist.

5.3.2 Scenario Two: Utilizing a Social Robot for the Therapy of Children with Language Disorders

In this scenario, two sets of thirty-minute speech therapy sessions were carried out for the two groups of participants. The intervention group attended RAST sessions, while the control group participated in conventional speech therapy interventions. During the RAST sessions, the RASA robot interacted with children in various ways, i.e., teaching the correct pronunciation of words via lip-syncing, providing a system of reward and punishment by expressing different emotional states, asking multiple questions, and guiding children to answer the therapist’s questions nicely. Five frequent tasks (extracted from relevant studies [79,80,81,82,83]) were performed in each speech therapy session for both groups of participants to facilitate the children’s oral language development. Table 4 describes the list of the activities conducted in the therapeutic interventions.

Table 4 The list of activities performed in the speech therapy sessions

Figure 10 displays the therapist and the robot in a RAST session.

Fig. 10
figure 10

A robot-assisted speech therapy session

5.4 Assessment Tools

5.4.1 Scenario One: Assessment of Children’s Engagement Level via an Imitation Game

In the context of HRI, content analysis of the interventions’ recorded videos was extensively employed to probe individuals’ behavioral patterns [84,85,86]. Meanwhile, analyzing the gaze data (frequencies and durations of gazes) provides metrics quantifying individuals’ engagement throughout human–human and human–robot interactions [87,88,89]. In this regard, in the first scenario, the evaluation of the participants’ engagement was conducted via deploying the manual video coding technique to elicit the children’s gaze information from the videos of the therapy sessions. The video coding was performed by two raters separately according to the following procedure.

First, due to the oscillating attribute of the participants’ attention and distraction during interventions, the game’s duration was segmented into specific equal spans (\({\Delta t = 20 s}\)). Secondly, in each span, the interval’s raw score was defined as the portion of the time children spent gazing at their playmates (either the robot or the therapist). Afterward, the mean score of each span was calculated by averaging the coders’ raw scores. Finally, the individuals’ engagement scores were computed by taking the integral of the participants’ mean scores over the interventions’ duration and dividing it by the length of the sessions. The Pearson correlation coefficient between the two raters’ raw scores was calculated to determine the inter-rater reliability of the results.

5.4.2 Scenario Two: Assessment of the RASA Social Robot’s Utility in Speech Therapy for Children with Language Disorders

In the second scenario, the Persian version of the TOLD was used to evaluate the impacts of the robot on children’s language development. This questionnaire is a certified tool for evaluating preschooler language abilities in six core and three supplemental subsets. The test’s subsets are summarized in Table 5.

Table 5 The TOLD subsets’ descriptions [31, 32]

A speech therapist was hired to hold the therapy interventions and score the Persian TOLD questionnaire. Following the test scoring instructions, the therapist asked each child several items to rate the test subsets. If the participant correctly answered the therapist’s question, they would have received a score of one; otherwise, they would have been given zero. Thus, the number of items the children correctly answered in each subset’s examination determined their respective raw scores. To eliminate the potential impact of the children’s age in assessing their language development, the TOLD proposed tables regarding the participants’ ages to convert their raw scores into scaled scores varying between 0 and 20. In the current study, the normalized scaled scores (the scaled scores divided by 20) were adopted as metrics to compare the participants’ oral language enhancement in the speech therapy scenarios. By integrating the subsets’ scores, composite scores were calculated that disclose the children’s development concerning primary facets of language, including listening, organizing, speaking, semantics, grammar, phonology, and overall language ability. The corresponding score of each language dimension was evaluated by summing the scaled scores of the subsets associated with the under-investigation skill and normalizing the calculated score according to the number of subsets involved in the skill leading to a score between zero and one [31]. Table 6 demonstrates the association of the TOLD subsets with primary language skills.

Table 6 The association between the TOLD subsets and primary language skills [32]

6 Results and Discussion

In the explained scenarios, the scores of both groups (intervention and control groups) were separately evaluated by the proposed assessment tools and then examined via statistical analysis using Minitab software. The p-values of the tests were employed to identify any significant differences between the intervention and control groups.

6.1 Content Analysis of the Recorded Videos

Figures 11 and 12 present the scores of each child’s engagement for the intervention and control groups, respectively. Table 7 also encapsulates the average and standard deviation of the two groups’ engagement scores and the statistical analysis results.

Fig. 11
figure 11

The scores of the children’s engagement in the intervention group

Fig. 12
figure 12

The scores of the children’s engagement in the control group

Table 7 The engagement scores of the intervention and control groups

As Table 7 demonstrates, the results of the t-test indicate that the engagement scores of the intervention group are significantly higher than the control group (p = 0.025 < 0.05). Statistic measures show that the intervention group members, on average, gazed at the robot for 11.7 s more than the control group participants looked at the therapist. Furthermore, the large Cohen’s d effect size (1.66 > 0.8) indicates the children disclosed a higher intention to play with the social robot than the therapist. According to the Pierson correlation coefficient, the two raters’ raw scores were strongly correlated (r = 0.72 > 0.7).

6.2 TOLD Analysis

This study quantified the participants’ language skills and overall language abilities through the scoring system proposed by the Persian TOLD questionnaire. The groups’ pre-test scores were statistically analyzed to explore the comparability of the intervention and the control groups in terms of their initial language levels. Furthermore, the implications of the robot’s presence on the participants’ language skills improvement were assessed by computing their progress scores, defined by subtracting the children’s pre-test scores from their post-test scores. The normalized results of these measures and their corresponding statistical analysis are summarized in Table 8. A Bonferroni correction (\({\alpha }^{^{\prime}}=\alpha /k\), where k defines the number of tests; in this study, the k is equal to nine, which is the number of the TOLD subsets) was utilized for the pairwise comparisons to avoid a Type I error.

Table 8 Comparison between the two groups of children's language development metrics

According to Table 8, p-values related to the administered pre-test highlight no significant differences (p > 0.05) between the intervention and control groups regarding the initial levels of language development metrics; thus, these groups can be considered comparable. The Bonferroni post-hoc tests indicate that only the scores of the “Word Discrimination” subset were significantly different between the intervention and control groups (p < 0.005).

As previously mentioned, the scores of the primary language skills could be evaluated by summing and normalizing the scores of the associated subsets, as explained in Table 6. Table 8 reveals that the two groups’ primary language skills scores in the pre-test were not significantly different, which means that the initial states of the groups were comparable. According to the Bonferroni post-hoc tests, the intervention group made significantly more progress in primary language skills than the control group. Furthermore, the overall language ability score, calculated by summing and normalizing the nine subsets’ scores of the TOLD, is a measure that represents the total language development of the children. The analysis of this metric shows that the overall language ability of the children who interacted with the robot improved significantly more than those who took part in conventional speech therapy sessions. The results of this preliminary exploratory investigation shed light on the encouraging implications of utilizing social robots in speech therapy sessions which are in agreement with the results of Ref. [4]. However, the limited number of study participants prevents us from making a generalized claim about the robot’s efficacy through interaction with other children with language disorders.

7 Limitations and Future Work

The COVID-19 pandemic restricted the families willing to collaborate with our research group, which resulted in the small number of study participants. This issue was a serious limitation of this examination, which underpowered the study, as proven by the power analysis. The fact that the authors had no control over the families of the special needs children and could not in good conscience deprive them of therapies for a longer span before the beginning of the examination was another study limitation. Consequently, a thorough separation of the possible influences of the previously experienced therapy sessions on the current therapeutic interventions was impossible. The temporary displeasure of a few children in a limited number of training sessions was another limitation of the current study. Although the training sessions were based on one-to-one interactions, a few children would have refused to participate in some sessions if one of their families or companions had not been present at the beginning of the training sessions. It should be noted that this could lead to possible social bias in the study’s results, which was inevitable. Due to the lack of similar studies about the utility of social robots in speech therapy interventions, it was hard for our team to compare the outcomes of this research with others comprehensively.

In our future work, we will increase the number of participants and consider the subjects’ gender as an independent variable to see whether the current findings can be generalized to RAST sessions for children with language disorders. Moreover, to encourage children to participate in the RAST sessions, they were initially engaged with the robot via an imitation game. However, the influences of the gaming scenario and the augmented robot’s features, including facial expression recognition and lip-syncing capabilities, on the therapeutic interventions were not explicitly investigated. Thus, further inspections would be required to rigorously assess whether the children’s language progress is attributed to only the robot’s presence or the augmented capabilities implemented on the robot. Additionally, the novelty of the robot could have repercussions on the outcomes of the two scenarios. Although the first week of the examination was dedicated to introducing the robot to children, quantitative investigations would be beneficial in negating the novelty factor’s impacts.

8 Conclusion

This paper addressed the potential benefits of employing a socially assistive robot in speech therapy interventions. The main focus of the study was to evaluate the robot’s capacity to engage children with language disorders and enhance their learning achievements. To attain the interventions’ objectives, two capabilities, facial expression recognition and lip-syncing, were developed for the employed robotic platform, the RASA social robot. The facial expression recognition model was achieved by training various well-known CNNs on the AffectNet database and modifying via the transfer learning technique to enhance the system’s performance in the robot’s environment. The lip-syncing capability was developed by designing and implementing an articulatory system on the robot, which endeavored to imitate human articulation. The study’s results, acquired by video coding, the Persian TOLD, and statistical analysis, revealed the prospects of using the RASA robot in speech therapy sessions for children with language disorders. However, one should avoid expecting considerable improvements and consider this study’s reported findings as preliminary exploratory results that must be interpreted with caution since the small number of subjects limits the investigation, as proven by the power analysis.