TeCM-CLIP: Text-based Controllable Multi-attribute Face Image Manipulation

Xudong Lou, Yiguang Liu, Xuwei Li; Proceedings of the Asian Conference on Computer Vision (ACCV), 2022, pp. 1942-1958


In recent years, various studies have demonstrated that utilizing the prior information of StyleGAN can effectively manipulate and generate realistic images. However, the latent code of StyleGAN is designed to control global styles, and it is arduous to precisely manipulate the property to achieve fine-grained control over synthesized images. In this work, we leverage a recently proposed Contrastive Language Image Pretraining (CLIP) model to manipulate latent code with text to control image generation. We encode image and text prompts in shared embedding space, leveraging powerful image-text representation capabilities pretrained on contrastive language images to manipulate partial style codes in the latent code. For multiple fine-grained attribute manipulations, we propose multiple attribute manipulation frameworks. Compared with previous CLIP-driven methods, our method can perform high-quality attribute editing much faster with less coupling between attributes. Extensive experimental illustrate the effectiveness of our approach.

Related Material

[pdf] [code]
@InProceedings{Lou_2022_ACCV, author = {Lou, Xudong and Liu, Yiguang and Li, Xuwei}, title = {TeCM-CLIP: Text-based Controllable Multi-attribute Face Image Manipulation}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2022}, pages = {1942-1958} }