VOODOO VR: One-Shot Neural Avatars for Virtual Reality

Phong Tran¹, Egor Zakharov², Long Nhat Ho¹, Adilbek Karmanov¹, Liwen Hu⁴, Maksat Kengeskanov¹, McLean Goldwhite⁴, Aviral Agarwal⁴, Ariana Bermudez Venegas¹, Anh Tuan Tran³, Otmar Hilliges², Hao Li^1,4

¹MBZUAI, ²ETH Zurich, ³ VinAI Research, ⁴Pinscreen

Abstract Code (coming soon) Youtube

Introducing VOODOO VR: a two-way immersive telepresence solution based on the state-of-the-art one-shot facial reenactment technology.

Instant Avatar Creation

One-shot Self Reenactment with VR Headset

Head Reenactment with Pose Disentanglement

One-shot Cross-reenactment with VR Headset

Immsersive Two-way VR Telepresence

VR Telepresence System

We integrate our facial reenactment method into a VR telepresence system. Our system can build the avatar instantly using a single regular RGB webcam. Then the captured avatar can be driven using the head pose and expression tracked by the VR headset.

Instant Avatar Creation. We use a single regular RGB webcam to capture a portrait image of the user. The whole process takes 3 seconds for the user to prepare and only 60 milliseconds to create the avatar. Users can also upload any portrait image into the system instead of capturing themselves. This unlocks the ability to customize the avatar or become a different person.

Driving the avatar. After creating the avatar, users can control it using a VR headset, which tracks the 6 DoF head pose with the built-in inside-out/IMU head pose tracker. For expressions, the headset can track 63 different facial muscle movements and eye gaze, corresponding to 65 blendshapes (63 for expression and 2 for eye gaze). We use these blendshapes to create a generic CG face, which is then streamed into our reenactment model as an expression driver.

Two-way communication system. Our system can be used for VR telepresence applications, allowing two users to communicate with each other in virtual reality. The communication system consists of two machines connected via TCP/IP. The avatar image, head pose, and expressions of the first user are streamed to the machine of the second user and vice versa. Since the avatar image is only sent once, we only need to stream the pose and expression, which is in total of 65 float numbers, therefore the latency is very low, allowing seamless communication..