Introducing VOODOO VR: a two-way immersive telepresence solution based on the state-of-the-art one-shot facial reenactment technology.

Abstract

We present a complete solution for real-time immersive face-to-face communication using VR headsets and photorealistic neural head avatars generated instantly from a single photo. Our avatars are view-consistent neural radiance fields of a person's head 3D lifted into disentangled appearance-expression representation from 2D photographs via transformer networks.

Instant Avatar Creation

One-shot Self Reenactment with VR Headset

Head Reenactment with Pose Disentanglement

One-shot Cross-reenactment with VR Headset

Immsersive Two-way VR Telepresence

VR Telepresence System

Teaser image

We integrate our facial reenactment method into a VR telepresence system. Our system can build the avatar instantly using a single regular RGB webcam. Then the captured avatar can be driven using the head pose and expression tracked by the VR headset.


Instant Avatar Creation. We use a single regular RGB webcam to capture a portrait image of the user. The whole process takes 3 seconds for the user to prepare and only 60 milliseconds to create the avatar. Users can also upload any portrait image into the system instead of capturing themselves. This unlocks the ability to customize the avatar or become a different person.

Driving the avatar. After creating the avatar, users can control it using a VR headset, which tracks the 6 DoF head pose with the built-in inside-out/IMU head pose tracker. For expressions, the headset can track 63 different facial muscle movements and eye gaze, corresponding to 65 blendshapes (63 for expression and 2 for eye gaze). We use these blendshapes to create a generic CG face, which is then streamed into our reenactment model as an expression driver.

Two-way communication system. Our system can be used for VR telepresence applications, allowing two users to communicate with each other in virtual reality. The communication system consists of two machines connected via TCP/IP. The avatar image, head pose, and expressions of the first user are streamed to the machine of the second user and vice versa. Since the avatar image is only sent once, we only need to stream the pose and expression, which is in total of 65 float numbers, therefore the latency is very low, allowing seamless communication..

Accompanying Video

BibTeX

@article{tran2023voodoo,
  title={VOODOO 3D: Volumetric Portrait Disentanglement for One-Shot 3D Head Reenactment},
  author={Tran, Phong and Zakharov, Egor and Ho, Long-Nhat and Tran, Anh Tuan and Hu, Liwen and Li, Hao},
  journal={arXiv preprint arXiv:2312.04651},
  year={2023}
}