A method for creating interactive, user-resembling avatars

Virtual reality (VR) applications have disseminated throughout several fields, with a special quest for immersion. The avatar is one of the key constituents of immersive applications, and avatar resemblance can provoke diverse emotional responses from the user. Yet, a lot a virtual reality systems struggle to implement real life-like avatars. In this work, we propose a novel method for creating interactive, user-resembling avatars using available commercial hardware and software. Avatar visualization is possible with a point-cloud or a contiguous polygon surface, and avatar interactions with the virtual scenario happens through a body joint-approximation for contact. In addition, the implementation could be easily extended to other systems and its modular architecture admits improvement both on visualization and physical interactions. The code is under Apache License 2.0 and is freely available as supplemental material. ABSTRACT 6 Virtual reality (VR) applications have disseminated throughout several ﬁelds, with a special quest for immersion. The avatar is one of the key constituents of immersive applications, and avatar resemblance can provoke diverse emotional responses from the user. Yet, a lot a virtual reality systems struggle to implement real life-like avatars. In this work, we propose a novel method for creating interactive, user-resembling avatars using available commercial hardware and software. Avatar visualization is possible with a point-cloud or a contiguous polygon surface, and avatar interactions with the virtual scenario happens through a body joint-approximation for contact. In addition, the implementation could be easily extended to other systems and its modular architecture admits improvement both on visualization and physical interactions. The code is under Apache License 2.0 and is freely available as supplemental material.


INTRODUCTION
Virtual reality (VR) is understood as an advanced human-computer user interface that mimics a realistic 19 environment by linking human perception systems with a virtual environment (Zheng et al., 1998). The 20 key point about VR is its attempt to provide seamless interaction with a computational environment, thus 21 understanding human intent and creating a reasonable response from the virtual environment. 22 The gaming industry has historically driven developments on VR (Mills, 1996), contributing to 23 lowering the cost of technology. There has been a growing demand for techniques or devices that improve 24 the user's ability to feel inside a game, surrounded by elements that promote the desired gameplay. To 25 achieve that, the industry has turned to motion tracking devices and immersive 3D graphics, including approach. Ideally, this system would be able to track body movements, record user images and present 47 them inside the head mounted display, reacting to what the user is looking at and how the user is moving. 48 A key issue is that there is scarce information on how to integrate and utilize both technologies in one 49 whole system. An example is shown by Woodard and Sukittanon (2015), explaining how both Oculus 50 Rift and Kinect could be used to create virtual interactive walkthroughs for building projects. However, 51 they do not use the Kinect's full capabilities, such as body-frame detection and RGBA frames to recreate 52 the avatar and interact with the virtual environment. 53 In accordance with growing evidence of the importance of avatar representation and interaction to 54 obtain an improved immersion experience in virtual reality, this work presents a software approach to 55 integrate a motion tracking device with depth sensing capabilities (Kinect V2) to a head mounted display 56 (Oculus Rift DK2) in order to improve the interaction and immersion perception in VR systems. This 57 software uses the SDKs of both devices, along with OpenGL for rendering graphics and Bullet Physics for 58 physic simulations, to create a highly resembling 3D representation of the user in the virtual space. This 59 3D representation, henceforth referred to as avatar, is highly sensitive to the user's movements and can  to provide 60 f ps and depth ranging from 15cm to 1m, but its frames per second throughput diminishes 76 quickly as the depth range increases, going from 2.5meters at 15 f ps to 4meters at 6 f ps (DepthSense,  identity. Focusing on those three levels is specially important in VR applications in order to improve 128 the user's performance in the system. In the discussion section, we present how the developed system 129 exploits those levels in order to create sense of embodiment.

130
This connection between avatar and physical self has been investigated. Wrzesien et al. (2015) focused 131 on how teenagers can learn to regulate their emotions by watching physically similar avatars deal with 132 frustration. The authors concluded that when the avatar is modeled to resemble the human user, the 133 intensity of the emotional states, emotional valence and arousal for each participant has a much greater 134 level than that obtained when the avatar is neutral. This confirms that the similarity between the avatar 135 and the user may cause a significant psychological response.

136
Another study (Fox and Bailenson, 2009) was concerned about how this similarity might modify 137 behavior. After dividing the participants in groups with a virtual self representation or with another person 138 representation, they were exposed to voluntary sessions of exercises with rewards and punishments. The  the benefits of an avatar representation, it is expected that such a simulation would benefit from the avatar 158 generation method described here by improving on the overall experience of the system.

159
As a last example, tourism or even architecture could benefit from this technology by giving the user 160 an immersive experience in which they could feel immersed. It is probably easier to create a sense of 161 space and dimensions if you can refer to your own body as a reference. In this sense, interior design 162 professionals and students could use our method to showcase a project and literally place the observer 163 inside the space to visually comprehend each detail of the project. In tourism, virtual environments can 164 replicate interesting places and allow the user to see and even interact with elements through its own 165 avatar.

167
Hardware and development environment preparation 168 Hardware integration from different manufacturers is a key point in VR applications and may increase 169 software development complexity. Manuscript to be reviewed Computer Science Table 2. Basic Requirements for the development of the software.

R1
The software shall create a 3D recreation of the user -known as the avatar.

R1.1
The avatar shall be defined by real time data from the depth and color sensors.

R1.2
The avatar shall be displayed in a 3D virtual environment in a HMD.

R2
The software shall offer the option to visualize in first-person perspective or thirdperson perspective.

R3
The software shall render at an acceptable frame rate (minimum 24 fps).

R4
The software shall provide simulated physical interaction between the avatar and the environment.

R5
The software shall provide a keyboard interface to select which user has the Oculus Rift point of view, when first-person perspective is selected.

R6
The software shall provide a keyboard interface to adjust the camera position in order to accommodate and match the user virtual visualization with the real world.

190
The requirements listed in Table 2  laser scanner data or photogrammetric image measurements (Remondino, 2003). For this purpose, this 206 paper uses the Kinect V2, which provides features beyond depth sensing that will be discussed later.

207
The rationale is that it is possible to create a polygon surface from the point cloud of one user.  Next, the depth information from the depth frame is mapped to camera space in order to create the 226 aforementioned point cloud, as shown in Fig. 2. This means that each pixel in the depth frame is mapped 227 to a 3D virtual representation of the real world, so that the software is able to plot a visualization for the 228 point cloud. Alongside, it is also needed to map the depth frame to the color information from the color 229 frame so that the software is able to associate RGB color to each point in the depth frame and render an   We also investigate the interactivity experience with respect to physical collisions of virtual objects with 284 the avatar. 285 We then proceed with three tasks that are able to further evidence the properties of the system. For    In the second task, we evaluate whether the system provides a realistic interaction experience, i.e., 298 if the actions performed by the participant are correctly captured by the Kinect sensor and displayed as 299 a point cloud avatar that is able to interact with virtual 3D objects. To accomplish that, participants are 300 asked to jiggle a green block from left to right for as long as they can, making sure that it does not fall.

301
Performance is assessed by the number of successful jiggles.

302
In the third and last task, we investigate if our Oculus plus Kinect method performance is similar 303 to that from other systems in terms of trajectory tracking and mismatches between the virtual and real 304 world integration. In this task, the participant is positioned in front of a table, located between six pillars 305 ( Figure 3A). The Kinect is placed at ≈ 1.5meters in front of the table, and six precision motion capture 306 infrared cameras (OptiTrack s250e, NaturalPoint Inc.) are attached to the pillars (one to each pillar). On 307 the table, we mark two lines (53 cm apart) and place a black block right to the side of the leftmost line.

308
Then, one participant is asked to position their hand next to the rightmost line and move toward reaching 309 the black block ( Figure 3B). The procedure is repeated for 120 seconds. Using reflective markers and the 310 IR cameras, we are able to accurately measure the participant's hand and the block spatial coordinates, 311 as well as detect when the hand crosses either of the lines. During the task, the participant is wearing One of the keystones for an immersive virtual reality experience is interactivity. In this approach, we used 345 the Bullet physics engine in order to recreate a desired physical interaction simulation between the avatar 346 and an arbitrary virtual object.

347
A green solid rectangular block was created in the virtual environment and placed within the reach of 348 the user's avatar. Figure 6 represents the moment when the user tries to balance the green block with their 349 own hands, preventing it from falling. In the code, the programmer can add more objects to the simulation 350 using the Bullet physics engine.

351
In our approach, each joint vertex provided by the Kinect becomes a bullet static object. It is possible 352 to visualize a general distribution of joints along the body in Fig. 4, where each red dot is a joint.

353
Therefore, each red dot can collide with virtual elements, considering the fact that the dot itself does 354 not suffer any consequence from the collision because its location depends only on the data from the 355 Kinect. In Fig. 6, the red dots are the contact points between the avatar's hands and the green block and 356 are responsible for pushing and moving the object at any condition.

358
Avatar interaction with the VR world was assessed with 3 tasks. In the first task, the block tracking task, 359 both naive and experienced users succeeded in following the sinusoidal block motion (Figures 7A and   360 7C). Note the high resemblance between block and user trajectories in the X direction (R 2 = 0.99), whilst 361 the Y and Z directions present a more divergent motion. 362 We proceeded by investigating the user ability in handling a virtual object. Figures reffig:tasksB and 363 7D show the results for the block jiggling task. Both naive and experienced users were able to manipulate 364 the object, but the latter was significantly better (95% confidence, as indicated by the non-overlapping 365 boxplot notches) at preventing the block from falling.  Figure 8A. Note that the OptiTrack trajectories are smoother than the Oculus plus Kinect trajectories,371 which is expected given the OptiTrack tracking method (infrared) and superior frame rate (120 frames/s).

372
The moment both systems detect that the participant has reached the block has a difference of ≈ 30ms

381
An approach for creating a real-time responsive and user-resembling avatar 382 This work presents a structured method for integrating both hardware and software elements in order to 383 conceive an avatar visualization that is responsive to user movements and can interact with virtual objects. creating avatar identity. The last level of self-representation, affection, is not directly addressed by the 396 system, as we do not shape or modify the avatar to increase affection. Yet it emerges from the affection the 397 user has for their own virtual representation, since it resembles closely their real body shape. Therefore,  The point cloud itself imposes another problem that affects the speed of the system. During the 430 rendering cycle, after the data is acquired from the Kinect, the system has to convert all the data points 431 from the depth space to the camera space (i.e. the real world 3D dimensions), this processing is made in 432 the CPU because it uses Kinect specific functions and this limits the processing throughput according to 433 the number of cores the CPU has.

434
In addition, there are several methods to create a surface from the point cloud. The method used 435 to create Figure 5 consists of connecting the camera space points in groups of three, considering the 436 2D matrix from which they were acquired, and rendering a triangle with OpenGL. However, using the 437 polygon rendering mode with OpenMP may cause the program to suffer a severe drop on rendered frames 438 per second, thus further work should explore this issue. When creating a smooth surface from the point 439 cloud, it is important to observe how the method will affect frames throughput on OpenGL engine. An 440 optimal implementation would seek to parallelize this process within the graphics processing unit (GPU).

441
The implementation of a point cloud triangulation algorithm such as presented in Scheidegger et al. (2005) 442 would have to be tested for real time applications.

443
Integration with eye and gaze tracking technologies 444 One of the key factors for immersion, as mentioned by Alshaer et al. (2017), is the Field of View (FOV).

445
In general, FOV or visual field is the area of space within which an individual can detect the presence of 446 visual stimulus (Dagnelie, 2011). For VR, one might assume that matching the FOV of VR headsets with 447 that of our own eyes would be ideal, but that is not necessarily the case.  In order to use the present system in a mobile setting, we have to consider how the system is organized 498 and how the interface with each device works. If it were to substitute the Oculus Rift for a smartphone-499 based HDM, there is still the need to use the Kinect to create the avatar representation. However, building 500 an interface to connect the Kinect directly with a mobile platform is challenging. A possible solution to 501 this problem is to implement a webservice whose main task would be to pipe pre-processed Kinect frame 502 data to the smartphone. In this architecture, the bulk of processing, which is converting depth points to 503 3D space, would be done in the desktop computer and a portable graphics rendering solution such as 504 OpenGL ES would be able to use this data to create the virtual environment similarly to what happens in 505 the Oculus Rift. A similar project was developed by Peek (2017), in which he can send depth and skeleton 506 data to a Windows Phone.

507
The method presented in this work lies within the realm of VR applications, specially those which 521 require avatar representation and physical interactions with the virtual environment. The hardware and 522 software approach we used is expandable to other systems. It is innovative in the way it binds the hardware 523 and software capabilities, by integrating the virtual environment with skeleton tracking, body detection, 524 and object collisions, in order to create a unique, user-resembling and interactive avatar. All of these 525 features, together, appeal to the final user and contribute to a immersive interaction with virtual reality.

527
We acknowledge the support from the staff and students from the Edmond e Lily Safra International 528 Institute of Neuroscience.