Preface: A Data-driven Volumetric Prior for Few-shot Ultra High-resolution Face Synthesis

Abstract

NeRFs have enabled highly realistic synthesis of human faces including complex appearance and reflectance effects of hair and skin. These methods typically require a large number of multi-view input images, making the process hardware intensive and cumbersome, limiting applicability to unconstrained settings. We propose a novel volumetric human face prior that enables the synthesis of ultra high-resolution novel views of subjects that are not part of the prior's training distribution. This prior model consists of an identity-conditioned NeRF, trained on a dataset of low-resolution multi-view images of diverse humans with known camera calibration. A simple sparse landmark-based 3D alignment of the training dataset allows our model to learn a smooth latent space of geometry and appearance despite a limited number of training identities. A high-quality volumetric representation of a novel subject can be obtained by model fitting to 2 or 3 camera views of arbitrary resolution. Importantly, our method requires as few as two views of casually captured images as input at inference time.
We recommend using Chrome to correctly display this website.

Ultra High-resolution Synthesis from Studio Captures

Given 3 views of a held-out test subject from our dataset, we show high-quality novel view synthesis. Note the 3D consistent rendering of details such as hair strands, eye-lashes, and view-dependent effects for example on the forehead.

Please note that due to the file size limit (100MB), we can only show limited high-resolution video results.

Input Views (4Kx6K)

Novel View Synthesis (4Kx4K - cropped to center)

Novel View Synthesis (2Kx2K)

Input Views (4Kx6K)

Novel View Synthesis (4Kx4K - cropped to center)

Novel View Synthesis (2Kx2K)





High-resolution Synthesis on FaceScape Dataset

We show novel view synthesis results at 2K resolution using only two input views of subjects from the FaceScape dataset. Note that our prior model was trained on a different dataset—these results represent the out-of-distribution setting.

Input Views

Novel View Synthesis

Input Views

Novel View Synthesis





In-the-Wild Captures

We demonstrate the generalization capability of our method to in-the-wild mobile camera captures. With just 2 input views, our method is able to genenerate highly consistent and photorealistic free-view renders of a subject. Our method not only reconstructs coherent geometry, it also learns to interpolate view-dependent specularities, such as on the hair and skin.

Input Views

Novel View Synthesis

Input Views

Novel View Synthesis

One-shot Synthesis

Synthesis from single view naturally suffers from bas-relief ambiguity making it much more challenging, but our prior model enables plausible results. We show results on our studio captures and one in-the-wild example.

Input View

Novel View Synthesis

Input View

Novel View Synthesis





Latent Space Interpolation

Alignment of faces in our dataset allows us to learn a continuous latent space, where the embeddings of training identities can be interpolated to achieve plausible intermediate identities. Note that we do not train our model in an adversarial manner but only with reconstruction losses.

Identity A

A -> B

Identity B

B -> C

Identity C





Ablation - Regularization

We show the effect of using regularization during model fitting. Note the colour distortions in the absence of view regularization and the fuzzy surfaces without the normal regularization.

Input Views

Without Regularization

View Regularization Only

Full Regularization

Ablation - Initialization

The model finetuning has to be initialised correctly in order to avoid artifacts like floaters or duplicate body parts, e.g., ears.

Input Views

Mean

Result

Initialization

Zeros

Result

Initialization

Noise

Result

Initialization

Inversion (Ours)

Result

Initialization

Ablation - Number of Views

Our method allows novel view synthesis from extremely sparse view, even from a single image. The examples below show the rendering quality improves when a variable number of views are available.

1 View

Result

Initialization

2 Views

Result

Initialization

3 Views

Result

Initialization

5 Views

Result

Initialization

7 Views

Result

Initialization

Ablation - Prior Model

We train the prior model on fewer number of identities and lower resolution. The results show that a more diverse prior model performs better while a lower resolution prior model might not necessarily be required.

Input Views

15 Identities (512x768)

Result

Initialization

350 Identities (512x768)

Result

Initialization

1450 Identities (256x384)

Result

Initialization

1450 Identities (512x768)

Result

Initialization

Geometry

We visualise the image-space geometry estimated by our method. Note the 3D consistent depth and normals. The normals in the hair have a grey appearance due to transparent density.

Input Views

Colour

Depth

Normals

Foreground Matte

Comparison with Related Works

Input Views

RegNeRF

EG3D-based prior

KeypointNeRF*

Ours

*We made a considerable effort to train KeypointNeRF at 1K resolution, but we found that their results at the resolution 256x256 is of much higher quality than their results at 1K. Therefore, the video presents their results at 256x256 resolution.




Limitations

While our method achieves state-of-the-art results in high-resolution synthesis of faces, it struggles with strong expressions and large accessories. This is due to limited representation in our training dataset. The training dataset only contains neutral faces and none of the subjects was wearing voluminous clothing like jackets. This limitation could potentially be mitigated by training a more diverse prior model that includes those modalities.

Smile

Input Views

Novel View Synthesis

Grin

Input Views

Novel View Synthesis

Heavy Clothing

Input Views

Novel View Synthesis

Cap and Eyeglasses

Input Views

Novel View Synthesis