Let’s Get in Touch! Adding Haptics to Social VR

Social VR shall allow natural communication between users with high social presence, as if users are in the same room. One way to increase social presence is to add haptic interaction to allow, for example, users to give each other a "high-five" or to pass documents among them. In this paper, we present our web-based VR communication framework with an added haptic component to simulate touch. The goal of this framework is to enhance the VR communication experience and the social cues exchange between users in VR. We describe our method for rendering haptic feed-back within the web-based framework and evaluate the perceived quality of our system with a user survey (with 119 participants). Our proof-of-concept system was rated positively, with the haptic component offering an enhanced quality of the VR experience for 78% of the participants.


INTRODUCTION
Virtual Reality (VR) technology has been growing in popularity for a wide variety of applications such as immersive video games, robot teleoperation, training applications, or as a social communication platform. Using VR for social communication enables shared experiences between VR users at remote locations. Research has shown that social context cues, understood as nonverbal cues, such as body posture, eye-gaze, gestures or touch, have positive correlations with perceived communication quality and satisfaction [25] and support communication [1]. Yet, current technologies for remote communication, as videoconferencing, have clear limitations for conveying social and shared aspects. Although audio and video media may be improved, these modalities only seem to convey a restricted flow of social information between users [32].
The added value of VR as a social communication medium lies in its affordance of social cues transfer. VR offers the users the opportunity of feeling immersed in the scene [4], by rendering multimodal media in a higher dimensionality than those of 2D experiences. Another advantage of VR is the possibility of integrating further modalities, other than 3D audio and video, for conveying subtle social context cues. For instance, simulating touch, more commonly referred as haptic feedback [14].
In this paper we present our ongoing efforts toward adding tactile modality to a social and shared VR experience, enabling users to communicate using nonverbal social context cues, and perform a first user test in our VR framework, for assessing the perceived system quality.

RELATED WORK
Some social VR services already allow communication and collaboration between remote users, as is the case of Facebook Spaces [21], Microsoft SharePoint [18], VRChat [16], or SteamVR [7]. However, these services represent users as avatars, which may be too restrictive for conveying nonverbal social cues and cause a feeling of disconnection from the VR experience [19].
In previous work [20], VR participant representation using photorealistic video quality was investigated. By representing users with video data from the capture and blending them in the VR space, users reported a high degree of satisfaction, immersion and presence in the interaction. Presence is understood as feeling of "being there", and, contrary to videoconferencing, VR experiences have been reported to elicit high levels of it [4]. This might influence the amount of social cues the users exhibit in a VR setting, as they experience it as more 'natural', and also translate in feelings of social presence often described as the "sense of being together with another" [2]. Social presence can be a vehicle for evaluating the social cue exchange between VR users.

Haptics Technology
Haptics technology simulates the sense of touch in computing [14]. It means both force feedback -simulating object hardness and inertia -and tactile feedback -simulating the contact surface and its properties, as temperature or smoothness [3]. Haptics requires portable, special-purpose hardware, commonly referred to as haptics interfaces. Generally, these interfaces have input transducers, sensors that measure positions and contact forces, and output transducers, actuators that display the contact forces and positions in appropriate spatial and temporal synchronization to the user. In [6], a recent overview of commercially available haptic devices is provided, particularly haptic feedback gloves suitable for VR applications. The interfaces are analysed according to force and tactile feedback actuators, degree of freedom (DoF) tracking per hand or finger, actuation and sensing principles. Some examples include the Plexus glove [5], Avatar VR [26], Senso Glove [15] and VRgluv [31].
Research around haptics is emerging and it concerns the development of algorithms and software to generate and render the "touch" of virtual environments and objects [23]. Haptics rendering refers to the algorithms that detect and report when and where the contact has occurred -collision detection -and compute in real-time the interaction feedback force for the haptic device [3]. Some known issues in this area include restricted bandwidth and low latency requirements.
Haptics modality has been proven to increase the perceived social presence and enhance shared task collaboration in virtual environments and non-immersive systems [24], [9]. Including this modality in social VR environments might also positively affect the user's perception of social presence.

SOCIAL VIRTUAL REALITY FRAMEWORK
Our goal is to create a shared, multimodal VR communication system, where participants can feel present and interact with other users in remote locations. The main contribution of this framework resides in its increased multimodality, achieved by including haptics for simulating touch in VR, along with 3D audio and video for a photo-realistic representation of participants.
The VR system architecture design is based on previous work [11], [20], and relies on off-the-shelf hardware and currently available web technology, for rapidly testing and evaluating VR experiences. Given the currently available haptic devices suitable for VR, we recurred to a vibro-tactile haptic glove (Manus/Elitac) for rendering the tactile modality. Our social VR framework can be  divided in four main steps: capture, process, stream and render, described in further detail in the following subsections and illustrated on Figure 1.
The setup for this framework, illustrated in Figure 2, includes two Intel ® RealSense ™ D415 RGB and depth (RGB-D) capturing sensors [17], connected to a laptop computer. The laptop computer runs on Windows 10, with NVIDIA GeForce GTX 980 Ti graphics card, Intel ® Core ™ i7-6700K processor and 16GB RAM, and is responsible for processing and transmitting the captured data. Each participant views the other participants in VR with a Samsung Odyssey Windows Mixed Reality Headset [28], which includes a microphone and is also connected to the laptop computer. The audio is displayed with a Sony MDR-1000X audio play-out [29]. The tactile modality is rendered through the vibro-tactile haptic glove Manus Prime One [30] with a custom Elitac extension [27], adding a total of 16 vibro-tactile feedback points to the user's hand. Each participant wears a glove on his/her right hand, while the glove is connected to the laptop computer (both over Bluetooth -Manus glove -and usb cable -Elitac extension). Note that the setup described here is required for each participant in the VR interaction.
In this framework, we utilize A-Frame and WebVR to create a 3D VR environment for the interaction. The VR environment is based on the Great Drawing Room 3D model [13] with an additional virtual table rendered in the middle of the room. The scene includes  two pre-recorded participants, rendered sitting by the virtual table, and allows for two participants to join the interaction by sitting together at the table, see Figure 3. When the users perform a high five in this VR world they perceive the haptic feedback. The steps to achieve this are further described in the paragraphs below.

Capture
In order to obtain a 3D representation of the participant, the capture step is performed with the two RGB-D sensors aimed at the participant from the front, from two different angles. The participant is located at a short distance from the sensors, illustrated in Figure 4. Each sensor capture results in a partial 3D representation (RGB-D image pairs) of the participant.

Process
The processing step comprises the one time calibration phase, the haptic device ArUco marker tracking and a conversion of the depth image into gray-scale.
The calibration phase concerns the alignment of the two RGB-D sensors used to capture the participant. The registering and aligning of the two sensors is done via the help of a large ArUco marker (30x30cm) and pose matching. This results in a near 180 • 3D representation of the user (front view), from the RGB-D frame pairs. The calibration parameters from the rigid body transformation are sent as metadata together with the RGB-D visual data.
The haptic device tracking is performed through another ArUco marker according to [8] and [22], with the same sensors used for the participant capture. The marker is placed in the palm of the haptic glove (Figure 1), for achieving accurate and time-synchronized tracking of the hand. The 3D pose and position coordinates of the hand are computed relative to the RGB-D capture and for each captured frame.
For transmitting the RGB-D frame data over WebRTC to VR over the internet, we convert the depth data into a 8bit gray-scale image, for complying with current video encoders. For this conversion we use an improved version of [12]: we convert the depth range of 0 − 1.5m into a 765 gray-colour value. The gray-scale depth image is concatenated to the RGB image to stream it as single RGB-D video stream. In the VR environment the depth image is converted back.

Stream
The streaming step was previously reported in [10]. It relies on a web framework with a peer-to-peer nature for delivering social VR experiences to each of the participants. This web streaming framework employs WebRTC for browser-based real-time communication.
All 2D video streams and users' audio are transmitted via We-bRTC. Any associated metadata, as well as the haptic glove tracking results, are transmitted via a central media orchestration server (also explained in [10]) over Socket.IO connections. For the hand tracking we only transmit a center point reflecting the center of the hand.

Render
Our client software is based on A-frame and WebVR and thus renders on any OpenVR capable VR HMD. We display each user a self-view, given the disconnected feeling to the immersion experience that arises if a person does not see himor herself in the VR experience. In previous work, we performed an experiment on self-view while giving a "high-five" in VR, to determine the optimal resolution and representation for a believable self-view. Based on the results, the participant self-view is rendered as point cloud (3D volumetric data) in our VR framework (see Figure 5). The point cloud is created by first converting the grayscale depth data to a depth value and then recalculating the 3D position of each point based on the calibration data of each RGB-D sensor. This is done in a unique GLSL WebGL shader and thus runs efficiently on the GPU.
Similar to the self-view, we display remote users based on the video from the WebRTC connection. The same shader is used for the gray-scale depth to 3D conversion but the end result is not displayed as point cloud but as a Mesh with the help of a PlaneB-ufferGeometry.
For the haptic glove rendering, we use a collision detection library that is part of the A-Frame-physics-system web component. We set a translucent sphere around the center point of each captured hand (own and remote). Once the spheres trigger a collision, a collision event is sent to a local haptic-feedback module (running on each local computer) via Socket.IO. The haptic-feedback module

SOCIAL VR COMMUNICATION USER SURVEY
As a first test for our social VR framework, we performed a user experiment followed by a questionnaire. The experiment aims to investigate the overall perceived quality of our social VR framework and setup, along with each of its components, and to collect information of what participants prefer in a social VR communication platform.

Procedure
The participants were given the opportunity to try the social VR framework in an unstructured way, for communicating with other participants. They tried the system on their own (in pairs) and some were accompanied by the facilitator joining in the room.

Participants
The experiment was performed with N = 119 participants from a convenience sample of visitors to our booth at VRDays Europe 2019, 13-15 November 2019, Amsterdam 1 . The participants sample consisted of 85 males and 34 females, with mean age M = 37.4 years and standard deviation SD = 10.2 years. Most of the participants (n = 100) had extensive VR experience (more than 5 times), some (n = 15) had only experienced VR a few times (less than 5), while a few (n = 4) had no experience.

Measures
The participants were asked to rate the social VR experience across five aspects: overall quality of the experience, video quality, audio quality, naturalness and restrictiveness. For this, a Likert scale ranging from 0 to 10 was used, where five is considered neutral, values below five correspond to negative ratings and values larger than five correspond to positive ratings. Furthermore, concerning the newly added tactile modality, participants rated the importance of being able to touch in VR with three choices (not important, somewhat important, very important) 1 https://vrdays.co/schedule-2019/ and were asked to make a judgement regarding the improvement achieved by adding tactile modality to the VR experience (experience enhanced, experience not enhanced). Finally, the participants answered open questions regarding the preferred and least preferred aspects of the system.

Results
The participants' ratings for the social VR experience across the five aspects yielded the results in table 1.
All aspects about the quality of experience were rated positively (above neutral), except for video quality which was rated negatively (below neutral). The highest positive score was achieved by the overall quality of the social VR experience. The audio quality aspect could not be reliably assessed by all the participants, because it was absent in some demonstrations.
A one-sample t-test indicated a significantly positive score for the overall quality (M = 6.0, SD = 1.9) of the VR experience in comparison to the neutral score, t(118) = 5.838, p < .001. The restrictiveness aspect (M = 5.5, SD = 2.3) had a borderline significant positive result (not restricted), t(118) = 2.412, p = .017. On the other hand, the video quality (M = 4.3, SD = 1.9) was rated significantly negative in comparison to the neutral value t(118) = −3.834, p < .001, and no significant results were found for the naturalness aspect.
These results are also reflected in the answers to the open questions, regarding the preferred and least preferred aspects of the VR experience. The answers conveyed judgements that could be grouped in categories of perceived quality, functionality and comfort for each of the system's components -audio, video, haptics, tracking and overall. Each of these components was judged as most liked, least liked, being restrictive or in need for improvement, in every category. Table 2 shows the number of open answers that judged each of these components as the most liked, the least liked, being restrictive or in need for improvement, according to quality, functionality and comfort categories. The video quality was reported as the leastliked component, followed by the quality of the haptic component. However, when analyzing the functionality answers, the haptic component was judged as the most-liked functionality, and the functionality of the video component was the judged to be the most in need of improvement. The overall system functionality and comfort were considered to be restrictive, due to the need of bodily equipment and the wires, which restricted the participants' movements. Focusing on the questionnaire items concerning the tactile modality, 78% of the participants responded that touch enhanced the quality of the VR experience, while 22% responded that touch did not enhance the experience. As for the importance of touching someone in VR, 42% of the participants found this very important, 47% rated this as somewhat important, while only 11% did not find it important. These findings agree with the feedback from the open questions that the haptic component was the most liked functionality of the system.

DISCUSSION AND FUTURE WORK
In this paper, we present the groundwork for adding tactile modality to our web-based VR communication framework. Our main contribution lies in realizing a shared VR communication system, which includes audio, video and tactile modalities for real-time VR communications, and through this, enrich the social cue exchange between VR participants. We described our approach and reported a first user experience test, to gain insight on the quality, effectiveness and receptivity of our system.
The results of our social VR framework user test confirm that our system still has several technical limitations (such as the video quality) to fully use it in everyday remote communications. However, with the overall positive ratings it is also clear that our system and virtual communication is appreciated by the users. The addition of the haptic interface for simulating touch in VR communication had positive reactions among the users, being the most liked functionality of the system. These results confirm other literature [24], [9] that haptic simulation of touch can add to the social presence of VR remote communications.
Suggestions for improvement of our social VR framework focused first and foremost on the visual quality. The visual modality should be enhanced by increasing the resolution and update rate of the VR scene, and the fragmented self-view and participant representation should be restored (holes and missing lower body parts of the other participants). In what concerns the haptic component, the hand tracking latency should be addressed and touch representation should rely on pressure instead of vibration. Furthermore, the framework should allow the ability to touch not only hands, but also other body parts and VR objects present in the scene. Regarding the overall setup, making the system wireless should be considered in order to afford free movement.
We are confident that the tactile modality is a positive addition to the user experience and can be a vehicle to create social presence in VR communications. Nonetheless, further research should be carried out in order to assess the immersion and the feeling of social presence that our social VR communication framework can elicit, after improving it with the feedback obtained.