Comparing technologies for conveying emotions through realistic avatars in virtual reality-based metaverse experiences

With the development of metaverse(s), industry and academia are searching for the best ways to represent users’ avatars in shared virtual environments (VEs), where real-time communication between users is required. The expressiveness of avatars is crucial for transmitting emotions that are key for social presence and user experience, and are conveyed via verbal and non-verbal facial and body sig-nals. In this paper, two real-time modalities for conveying expressions in virtual reality (VR) via realistic, full-body avatars are compared by means of a user study. The first modality uses dedicated hardware (i.e., eye and facial trackers) to allow a mapping between the user’s facial expressions/eye movements and the avatar model. The second modality relies on an algorithm that, starting from an audio clip, approximates the facial motion by generating plausible lip and eye movements. The participants were requested to observe, for both the modalities, the avatar of an actor performing six scenes involving as many basic emotions. The evaluation considered mainly social presence and emotion conveyance. Results showed a clear superiority of facial tracking when compared to lip sync in conveying sadness and disgust. The same was less evident for happiness and fear. No differences were observed for anger and surprise.

in various aspects. They can be realistic, 4 or cartoon-looking. 5 They can also differ in terms of complexity, 6 ranging from ergonomic and minimalist visualizations 7 in which only the user's head and hands are shown, up to configurations much closer to reality, with complete figures from head to toes. 8 The latter may be also provided with facial expressions. 2 Research on avatars and their representation, though, is still in an early stage. In particular, there are only a few studies aimed at measuring the impact of the representation of an avatar from a psycho-sociological and perceptual point of view, especially in XR applications. 9 Indeed, some studies showed that users prefer avatars populating VEs to be realistic, full-body, and endowed with facial expression. 2,8,10 Capturing facial expressions, however, is still challenging when using, for example, a Head-Mounted Display (HMD) for VR, although devices supporting this functionality are appearing on the market given the importance of these social cues for emotions conveyance. 11 An alternative way to reproduce facial expressions in these situations is to use lip synchronization (lip sync), that is, synchronized animations on a speaking avatar. 12 Moving from these premises, this work operates a comparison between the above modalities for conveying emotions through realistic full-body avatars in a multi-user VR-based scenario. The two modalities, implemented with currently available consumer-level hardware (HTC Vive Facial Tracker for the first one) and software (SALSA Lip Sync Suite v2 for the second one), have been evaluated by means of a user study involving 28 participants. The participants were asked to wear the HMD and observe an avatar in a VE while playing six scenes, each representing one of the Ekman's basic emotions. 13 Each scene was played two times (one per modality) with a pseudo-randomized order of exposition. In this way, it was possible to evaluate the performance of the two modalities in a context as similar as possible to the conditions of a real social VR scenario. After spectating a given scene in both modalities, participants blindly evaluated them in terms of relevant social presence aspects like comfort, expressiveness, realism, naturalness, pleasantness, and emotion conveyance through direct comparisons. The experimental results showed the superiority of the facial tracking-based approach with respect to lip sync approximation for most of the emotions (precisely, sadness, fear, disgust, and happiness), whereas nothing could be said for two of them (surprise and anger).

RELATED WORKS
To date, the impact of avatar appearance in multi-user XR scenarios has been investigated by many literature works. [14][15][16][17][18][19] In particular, researchers have examined how characteristics of an avatar reconstruction, among others, aesthetic traits, skills, movements, and behavior, actually influence relevant aspects related to interpersonal 20 and non-verbal 21 communication, advertising, 22 and so forth. For instance, Roth et al. 23 investigated the effects of the reduction of social information and behavior channels in immersive VEs with full-body, humanoid avatar embodiment. Their results showed that the lack of realism of the avatar hindered and constrained social interactions. The absence of an implementation (and an analysis) of facial expressions and gaze, however, limited the generalizability of the work. The realism of humanoid avatars was also investigated in Dobre et al. 10 In particular, the authors evaluated the effect of both realistic and cartoonish, full-body avatars on sense of presence in a MR-based telepresence scenario (i.e., an online meeting). The results of the investigation showed that the nonverbal behavior of the realistic avatar was perceived as more appropriate to the interaction and more useful for understanding others than that of the cartoonish one. Based on these studies, it appears that realistic avatars represent the optimal solution from various, fundamental perspectives in the context of social VR (i.e., presence and social interaction). Therefore, it was chosen to consider realistic avatars as the basis for the current investigation.
Once the overall appearance of the avatar was decided, the focus was shifted on possible alternatives for body reconstruction. For instance, Yoon et al. 4 analyzed the effect of avatar appearance on social presence with three levels of body parts visibility (head & hands, upper-body, and whole-body) in AR. The authors found that the realistic, whole-body avatar was perceived as being the best setup for remote collaboration. Similarly, Calandra et al. 8 studied two techniques for avatar representation in multi-user VR scenarios, that is, a head & hands configuration consisting in displaying the 3D models of the VR equipment (HMD and controllers) worn by the user, and a full-body reconstruction obtained by blending Inverse Kinematics (IK) and animations. Voice communication was guaranteed by means of a VOIP channel, although none of the two techniques provided facial animations. The comparison was performed in the context of an emergency training scenario already presented in Reference 24, in terms of embodiment, social presence, and usability. Results showed a preference for full-body reconstruction in critical aspects such as mutual awareness, mutual attention, mutual understanding, immersion, aesthetics and multi-user interaction. Taking into account these further works, it was determined that full-body avatars could represent a better solution in multi-user experiences compared to representations with lower visibility. For this reason, full-body avatars were considered in this study.
A fundamental feature of realistic humanoid avatars is the possibility to show facial expressions, which can be used as an additional layer of communication in addition to voice and body gestures. The inclusion of facial and eye tracking was explored, for example, by Kasapakis and Dzardanova, 25 who investigated the performance of a multi-user VR learning environment populated with a high-fidelity educator's avatar. The avatar featured facial cues, eye and body motion recorded in real-time (using a VIVE Pro HMD, five VIVE Trackers to track the motion of the pelvis, hands, and feet, ManusVR Gloves, Pupil Labs Eye Tracking system, and BinaryVR to map facial expressions to blendshapes). According to the reported results, the body and eye movement, together with the facial cues of the educator's avatar helped 90% of the students to maintain their attention during the lecture, thereby increasing their understanding of the concepts presented.
The direct mapping of a user's facial expressions onto an avatar's face has indeed advantages over an expressionless avatar. This mapping, however, typically requires additional hardware or costly VR systems. Hence, more "approximate" software solutions are frequently preferred, which still offer higher functionality levels than a completely static face. One solution that falls into this category is the so-called lip sync approximation, which makes it possible to generate plausible animations of the lower part of the avatar's face based on real-time audio. The use of these techniques as an alternative to audio alone has been already studied. For instance, Hube et al., 12 extended a previous work 26 to examine the influence of facial visual parameters on virtual avatars in VR. Starting from a pre-determined set of audio files, the authors extracted the emotions represented into them and operated a real-time lip sync approximation (through SALSA Lip Sync Suite) with the aim to present this information on a virtual avatar in a VE. The experiment consisted in placing the participants in an immersive VE in front of a half-body, realistic-looking avatar, and making them observe the generated avatar behaviour while listening to the associated audio. This visualization was then compared with an audio-only variant, in which the avatar remained still. The obtained results indicated the superiority of using additional visual parameters on the avatars' face, as they could help to determine the emotions in the audio clip. This study however, did not consider the use of body language to enhance the conveyance of a state of mind. In particular, it only evaluated a small number of possible non-verbal cues, important features that users typically rely onto for better identifying a particular emotion. By considering the last two studies, it can be observed that available approaches to convey users' emotions through their avatars' facial expressions are characterized by different levels of functionality and deployability.
Summarizing the above review, it appears that full-body, realistic-looking avatars proved to be the most effective representation technique under various point of views and in a wide range of scenarios. Moreover, it also seems that the presence of facial cues, in the form of facial expressions, tended to provide an improved experience compared to expressionless avatars. These indications were used as a starting point for the realization of the work. According to the review, to implement facial expressions different approaches have been pursued. To the best of the authors' knowledge, however, a comparison of two of the most promising approaches, that is, facial and eye tracking, and lip sync approximation, applied to full-body animated avatars in the context of multi-user social VR scenarios has not been performed yet. The current paper tackles this lack, formulating the hypothesis that the employment of a facial tracking technology could, in general, better convey the emotional content of user's behavior with respect to lip sync approximation, which only recreates the animation of the lower part of the face. The facial capture system is capable of reproducing a broad spectrum of facial expressions, whereas the automated lip-sync tool solely concentrates on synchronizing the lips and doesn't consider the rest of the face. As a result, the difference between the two modalities may appear obvious if one considers their contribution as isolated from everything else. In real-life social VR scenarios, however, the facial component cannot be separated from voice and body movements. Moreover, in some situations, facial expressions may not be the predominant way to convey emotions. Thus, for some emotions, a less pronounced difference between the two modalities may be expected.

MATERIALS AND METHODS
In this section, the technologies that were used to implement the two configurations considered in the study are discussed, along with the scenario against which they were evaluated.

Configurations and technologies
The VE was developed in Unity 2021.3 as a OpenXR application. The VR kit selected for the experiment was an HTC Vive Pro Eye 1 . The basis for both the considered representations was a realistic-looking, full-body avatar implementation developed in a previous work. 8 In particular, the VRIK module of FinalIK 2 was used to obtain a plausible full-body motion starting from the position and orientation of the HMD and the hand controllers. This module uses IK for the upper body, and a blending of animations to manage walking. The user's voice was captured through the HMD microphone.
For the first modality being studied, in the following referred to as Facial+eye Tracking (FT), the selected HMD (already provided with eye tracking) was equipped with an HTC Vive Facial Tracker 3 (Figure 1a). The reasons behind this choice are manifold. Firstly, the HTC Vive is one of the most appreciated ecosystems by social VR users due to the possibility to employ additional tracking devices to obtain a full-body reconstruction. 27 Secondly, the HTC Vive Pro Eye is already provided with eye tracking capabilities, and can be easily and seamlessly integrated with the HTC Vive Facial tracker to add the relative functionalities, thus representing a ready-to-use consumer solution for this purpose. This configuration allowed to perform a real-time mapping between the user's facial expressions and eye movements and a given mesh provided with eye and facial rig (in terms of blendshapes). In this case, the mesh was the full-body avatar, which was created with Autodesk Character Generator 4 . The character models created with this tool are automatically provided with a full rig (body and face), can be seamlessly integrated with FinalIK, and are almost compatible with the Vive Tracker facial rig (except for the "frown" blendshape, that had to be manually added in Autodesk Maya by merging other ones).
The second modality, later referred to as Lip-Sync approximation (LS), did not rely on additional hardware modules, as it used real-time algorithms to generate a plausible facial motion starting from an audio clip. To this purpose, similarly to what was done in Reference 12, the SALSA LipSync Suite 5 asset was employed. Along with the main asset, the One-Click automatic configurator for Adobe Character Generator was used to guarantee a smooth integration with the considered humanoid mesh, along with the Eyes module to generate a plausible random eye motion. The choice of SALSA as lip sync approximation was due to the fact that it is a very common software solution to provide real-time, plausible facial animation to talking avatars in Unity-based VR experiences. In fact, it represents a good trade-off between the cost for the end-user (no hardware required) and animation quality (e.g., the upper face can be either animated randomly, or kept non-animated). Moreover, as seen in the above review, it has also been the subject of investigations in previous works on the topic.

Scenario
In order to better evaluate the contribution of the studied modalities in terms of emotion conveyance, it was decided to perform an analysis per single emotion, similarly to what done in Reference 12. The authors of that work focused on three main emotions (happiness, sadness, and anger). In the present study, it was decided to widen this set to cover all the six Ekman's basic emotions 13 (anger, happiness, sadness, fear, disgust and surprise). For each emotion, the participants were shown an actor starring a short script. Each script was written with the aim to maintain the feeling and the overall perception within the field of the particular emotion represented, based on the scales of positivity and value that are generally used to define perceived emotions in studies on the subject. 28 Furthermore, an attempt was made to make the scripts as "social" as possible, and not to make them longer than a minute each. Figure 1b shows some excerpts from the scenes that the participants spectated and rated. The original full scripts in Italian, their English translations, and the recordings of the six scenes are available for download 6 . In order to minimize the possible sources of bias associated with the need to watch the same scenes multiple times, it was decided to avoid relying on an a live performance of the actor in a real multi-user scenario, but rather to opt for a simulated one. Thus, the actor's performance was pre-recorded; an avatar driven by these recordings was then presented to the participants, effectively uniforming the experience. A single actor was involved in the acting of all the scenes. The actor was a 26-year-old man practicing acting as an amateur. The scenes were replayed and recorded several times. Finally, the best six recordings were selected, one per emotion. It is important to note that voice intonation can also influence emotion conveyance. [29][30][31] For this reason, the actor was requested to adopt the intonation that would best match the emotion to be conveyed in the scene. Then, in selecting the recording, the appropriateness of the intonation was also taken into account.
Recorded inputs included voice (microphone), movements (HMD and controllers), and facial expressions (blendshapes), which would be the data to be transmitted over the network for driving the representation of each user's avatar in a real multi-user scenario. For the body, FinalIK was in charge of driving the body motion based on the head and hands movements for both the evaluated modalities. For the FT modality, all the recorded data were synchronized and reproduced at the same time, obtaining a result similar to having an actor playing in real-time in a real multi-user scenario. The same recordings were used to drive the representation for the LS modality, simply discarding the data related to the facial blendshapes and eyes motion, keeping the same body and voice input, and using SALSA to approximate the lip sync. This choice made it possible to keep the body movements and voice intonation perfectly identical in the two conditions investigated. Given the focus on facial expressions, this was essential to remove possible sources of bias.

Experimental procedure
The experimental activity was designed as a within-subjects user study, with a sample of 28 participants (16 males, 12 females) aged between 21 and 68 years, recruited among the students and staff at the authors' university. The hardware used for the experiment was the same used for the recording of the six scenes. Initially, the participants were requested to fill in a brief demographic questionnaire regarding age, gender, general experience with VR, and experience with multi-user social VR/metaverse applications. A large part of the sample answered that they had limited experience with VR technology, whereas the vast majority indicated they did not have previous experience with social VR or metaverse(s). In particular, ∼ 61% of the participants rated their experience with VR technology with a score equal to or below 2 on a scale from 1 to 5, where 1 indicated almost no experience and 5 indicated a daily use. Regarding the fruition of social VEs and metaverse(s), the average scores were even lower, with ∼ 78% of the participants who stated they almost never experienced this kind of applications. The participants were told that the study was focusing on the expressiveness and communication ability of the avatar, in particular, of the upper body, where the three main communication factors that intervene in any conversation in the real world (tone of the voice, arm and hand gestures, and facial expressions) are more evident.
After this introduction, the participants were provided with the HMD and invited to enter a VE depicting a theater 7 , where they could see the actor's recorded avatar. In each scene, the participants were free to move on the stage via natural walking around the avatar to observe gestures and expressions from the preferred position ( Figure 2). This freedom allowed to recreate conditions similar to those users would find in a social VR environment. Latin square order was used to balance the exposition to the six scenes (emotions). Each scene was played one time per modality, in a randomized order. Thus, the participants were not aware of which modality they were actually experiencing (modality 1 or modality 2, based on the order of exposition). After having watched a pair of scenes referring to the same emotion, the participants were asked to answer a comparative questionnaire. Previous works in this field did not provide a standard methodology 6 Scene scripts and recordings: http://bit.ly/3xNnH52. 7 Madame walker theatre: https://sketchfab.com/3d-models/madame-walker-theatre-98ba4154bbb644bb9cb4d9c68d7dd87b. suitable for this kind of investigation. Therefore, it was necessary to create an ad-hoc set of questions aimed to measure relevant aspects, such as the comfort with the representation, expressiveness, realism, naturalness, likelihood, pleasantness, emotional conveyance, and overall preference. This process was repeated for the six emotions. In order to limit the possible ambiguity of questions related to comfort and pleasantness when applied to negative emotions (i.e., disgust, anger, and sadness), the participants were told to approach the scenes with a neutral and detached point of view with respect to the emotion expressed. Finally, the participants were asked to provide comments in the form of open feedback. The full questionnaire is available for download 8 .

RESULTS
In this section, the experimental results (summarized in Figure 3) are presented and discussed. The Shapiro-Wilk test was used to evaluate the normality of data. Since data resulted as non-normally distributed, the non-parametric Wilcoxon signed-rank test was performed, with a 5% significance threshold (p-value < .05). After running the experiments, a power analysis was performed, and a value of ∼ 0.99% with a minimum effect size equal to ∼ 0.85% was obtained. This outcome confirmed the significance of results, which in the following are discussed on a per-scene basis.
In particular, the results (in terms of average scores standard deviations) for the "Sadness" scene are provided in Figure 3a. In this case, the FT modality was judged as significantly superior to the LS modality across all the eight dimensions investigated through the questions asked to the participants. In particular, FT was more convincing than LS with regard to the general expressiveness of the avatar ( = .014). This outcome is in line with the expectations, since the emotion of sadness is mostly conveyed by eye and mouth movements 13 and, based on observations made in the development steps, the output of the FT modality was found to be more accurate than the approximation provided by the LS.
Moving to the "Disgust" scene (Figure 3b), the results again highlighted a significant preference of the participants in favour of the FT modality over the LS one, albeit not completely as for the previous scene. For this emotion, the participants found the FT as more realistic (3.  instead, results did not significantly differ. This outcome may be related to the fact that some emotions (in particular, disgust and surprise) are often expressed impulsively and, thus, on a shorter time frame compared to others that may be more persistent over time (anger, sadness, and happiness). Thus, it may be harder to notice differences between the two modalities, especially if the participant happens to be not particularly focused on the other peer's face at the very moment in which the emotion is impulsively expressed. This aspect was also reported by some participants in the open feedback section.
For what it concerns the "Fear" scene, (Figure 3c), the FT modality was judged as superior to the LS one for four dimensions. In particular, the participants found FT as more natural (3.60 vs. 2.40, p = .01), more plausible (3.53 vs. 2.47, p = .02), more capable of conveying the emotional content of the scene (3.57 vs. 2, 43, p = .04) and preferred it, overall (3.60 vs. 2.40, p = .02). For the other dimensions, no significant differences were found. This outcome may be explained by the important role of the upper face for expressing this emotion, for example, in terms of cheekbones eyebrows motion, which is not managed by LS. Furthermore, based on the open feedback, some participants noticed that the LS did not generate any mouth movement when the actor was gasping, which is a common action in this particular scene. This behavior is likely due to a limitation of the software, which does not trigger animations for sounds different than speech.
Regarding the "Happiness" scene (Figure 3d), no clear winner was observed again. FT was perceived as significant better than LS in terms of realism (3.46 vs. 2.54, p = .03), general likelihood (3.46 vs. 2.54, p = .04), pleasantness (3.64 vs. 2.36, p = .016) and general preference (3.50 vs. 2.50, p = .05). It should be noted that, for what it concerns expressiveness, average scores for the two modalities were extremely similar. This result may be related to the fact that the HTC Vive Facial Tracker showed some issues in detecting the user's smile or in properly triggering the blendshape associated with the smiling action. This explanation is in line with some of the comments by the participants, who stated that they had not seen the avatar smiling despite the context, and therefore both the modalities appeared as not particularly expressive. For what it concerns emotion conveyance, results were close to the significance threshold (3.39 vs. 2.60, p = .073), suggesting that with a larger sample size the difference could possibly become significant.
Nothing can be said for the "Anger" scene ( Figure 3e), probably again due to the impulsive nature of the given emotion. Moreover, some participants commented in the open feedback section that the scene heavily relied on speech, for which LS is probably on par with FT.
Similarly, also for the "Surprise" scene ( Figure 3f) no significant differences were found. This outcome may be related to the fact that surprise is a very ephemeral emotion, very expressive for short periods but then difficult to be maintained for a long time, similarly to anger or sadness. Thus, it may be possible that the presence of other elements (such as body movements) and the audio clip itself provided already enough cues that other differences in the face could not be spotted, despite the theoretical higher fidelity of FT with respect to LS for facial expressions not related with speech.

CONCLUSION AND FUTURE WORKS
In this work, two avatar representations were compared by means of a user study in terms of emotion conveyance in the context a social VR/metaverse-oriented scenario. Both the modalities rely on a realistic full-body avatar representation obtained by blending the contribution of IK and animations. The first modality, labeled Facial+eye Tracking (FT), achieves the described representation by employing additional hardware elements (HTC Vive Pro Eye + Facial Tracker) to track the user's eyes and facial expressions. The second modality, labeled Lip-Sync approximation (LS), takes advantage of a commercial software solution (SALSA Lip Sync) to drive a visually-plausible approximation of the facial movements related to speech based on the audio captured by the HMD microphone.
The study involved 28 participants who blindly evaluated the two configurations by separately spectating an avatar performing six pre-determined scenes related to the six Ekman's basic emotions. 13 The evaluation covered important social presence aspects such as comfort, expressiveness, realism, naturalness of the expressions, likelihood, pleasantness, emotion conveyance, and overall preference. The experimental results showed a clear superiority of FT with respect to LS for what it concerns the conveyance of sadness and disgust for most of the analyzed dimensions, highlighting the importance of having a faithful representation of the user's eyes and facial cues in the selected use case. The same trend was observed for happiness and fear, but to a lesser extent, probably because of the mentioned issue with the hardware regarding the detection of the smiling action, as well as of the more ephemeral and impulsive nature of the emotion for the latter emotion. Finally, nothing can be said for the remaining emotions. For anger, this outcome may be related to the fact that the scene depiting it was mostly based on speech, for which LS (via SALSA Lip-Sync) showed solid performance. Regarding surprise, the ephemerality of the emotion may have played again a role in making it not so noticeable with both FT and LS. Future developments will be devoted to extend the analysis reported in this work, for instance by considering in the evaluation further approaches and technologies that can be used to convey emotions in social VR scenarios, either hardware-based (e.g., the facial tracking of the Meta Quest Pro) or software-based (e.g., emotion detection algorithms, in combination with SALSA to better approximate the whole face). The scenes representing the emotions may be refined too, in order to address the limitations related to the reduced "screen time" of some emotions (e.g., disgust and surprise), with respect to other more persistent emotions. Similarly, the anger scene may be modified to be less speech-based, and to include more anger-related facial expressions (e.g., shouts). The incorporation of multiple scenes to represent each emotion could be also considered. In fact, since emotions may be expressed in different ways depending on the situation, the use of different scripts for a given emotion would allow to achieve more general results. Furthermore, other factors influencing non-verbal communication may be included in the evaluation, like body gestures and voice intonation. To this purpose, full body tracking solutions may be considered as alternative to IK. Finally, further avatar representations may be considered (e.g., cartoonish, non-human, abstract), possibly integrated with other techniques for conveying emotions (e.g., with faceless avatars, additional user interface elements to communicate the user's emotional status), to study their suitability for social VR scenarios and metaverse(s).

ACKNOWLEDGMENTS
This work has been carried out in the frame of the VR@POLITO initiative. The authors want to to acknowledge the help of Gianmario Lupini, who contributed to the development of the system and to the experimental activity. Fabrizio Lamberti received the M.Sc. and the Ph.D. degrees in computer engineering from Politecnico di Torino university, Italy, in 2000 and 2005, respectively. Currently, he is a Full Professor at the Dipartimento di Automatica e Informatica at Politecnico di Torino university, Italy. His interests are in the areas of computer graphics, human-machine interaction and intelligent systems. He is a senior member of the IEEE and the IEEE Consumer Technology Society, for which he is currently serving as VP Technical Activities (BoG Member-at-Large 2021-2023).

SUPPORTING INFORMATION
Additional supporting information can be found online in the Supporting Information section at the end of this article.