Generating faithful visualizations of human faces requires capturing both coarse and fine-level details of the face geometry and appearance. Existing methods are either data-driven, requiring an extensive corpus of data not publicly accessible to the research community, or fail to capture fine details because they rely on geometric face models that cannot represent fine-grained details in texture with a mesh discretization and linear deformation designed to model only a coarse face geometry. We introduce a method that bridges this gap by drawing inspiration from traditional computer graphics techniques. Unseen expressions are modeled by blending appearance from a sparse set of extreme poses. This blending is performed by measuring local volumetric changes in those expressions and locally reproducing their appearance whenever a similar expression is performed at test time. We show that our method generalizes to unseen expressions, adding fine-grained effects on top of smooth volumetric deformations of a face, and demonstrate how it generalizes beyond faces.
We present generated video sequences from models trained on datasets used in the paper. We directly compare visualizations generated from our method with baselines: VolTeMorph [1] trained on all the frames in the data and VolTeMorph trained on the single, most extreme expression. We start by showing extrapolation capabilities for all the methods, by modifying the expression vector $\mathbf{e}$ vector directly. We then show the renderings driven by an external expression data from Multiface dataset [2]. We end this page by showing how these approaches perform on the synthetic datasets. We additionally show results from baselines that base on the conditioning signal in the form of the expression vector. Our method generates the most realistic images while providing the controllability of the face.
As we work in the area of Neural Radiance Field, we can generate any camera movement which may be an application for future communication devices.