iVS-Net: Learning Human View Synthesis from Internet Videos

Junting Dong, Qi Fang, Tianshuo Yang, Qing Shuai, Chengyu Qiao, Sida Peng; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 22942-22951

Abstract


Recent advances in implicit neural representations make it possible to generate free-viewpoint videos of the human from sparse view images. To avoid the expensive training for each person, previous methods adopt the generalizable human model and demonstrate impressive results. However, these methods usually rely on limited multi-view images typically collected in the studio or commercial high-quality 3D scans for training, which heavily prohibits their generalization capability for in-the-wild images. To solve this problem, we propose a new approach to learn a generalizable human model from a new source of data, i.e., Internet videos. These videos capture various human appearances and poses and record the performers from abundant viewpoints. To exploit these videos, we present a temporal self-supervised pipeline to enforce the local appearance consistency of each body part over different frames of the same video. Once learned, the human model enables creating photorealistic free-viewpoint videos from a single input image. Experiments show that our method can generate high-quality view synthesis on in-the-wild images while only training on monocular videos.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Dong_2023_ICCV, author = {Dong, Junting and Fang, Qi and Yang, Tianshuo and Shuai, Qing and Qiao, Chengyu and Peng, Sida}, title = {iVS-Net: Learning Human View Synthesis from Internet Videos}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {22942-22951} }