UP-VTON: A Unified Virtual Try-On Framework Supporting Mask, Mask-Free, and Prompt-Driven Guidance

Youngjoo Jo, Minho Park, Dong-oh Kang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025, pp. 7030-7038

Abstract


Image-based virtual try-on (VTON) aims to synthesize realistic images of a person wearing a target garment. While recent advances in image generation have improved visual quality, existing methods are typically categorized as either mask-based or mask-free. Mask-based approaches rely on clothing masks to localize garment regions but often cause artifacts and identity distortion. Mask-free methods eliminate this dependency but can suffer from hallucinations and poor garment-person alignment. We argue that users should be able to control the use and extent of garment masks, as rigid assumptions hinder flexibility and fine-grained editing. Moreover, many prior works require additional modalities--such as keypoints or DensePose--which complicate the pipeline and increase annotation costs. To overcome these limitations, we propose UP-VTON, a unified virtual try-on framework that performs robustly with or without garment masks and supports prompt-based controllable generation. Our approach introduces triptych prompting, a hybrid inpainting strategy guided by reference images, text prompts, and visual cues. Without masks, the model generates from scratch using full-image masking while allowing flexible region control to reflect user intent. We also construct a diverse dataset without requiring segmentation or pose annotations and employ prompts from a large multimodal model to guide garment fit and style. Experimental results demonstrate that UP-VTON outperforms existing methods in flexibility, controllability, and visual realism, enabling high-fidelity and modality-free try-on synthesis.

Related Material


[pdf]
[bibtex]
@InProceedings{Jo_2025_ICCV, author = {Jo, Youngjoo and Park, Minho and Kang, Dong-oh}, title = {UP-VTON: A Unified Virtual Try-On Framework Supporting Mask, Mask-Free, and Prompt-Driven Guidance}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {7030-7038} }