MFT-VITON: High-Fidelity Virtual Try-On with Minimal Input via a Mask-Free Transformer-Diffusion Model

Wan, Zhenchen; Xu, Yanwu; Hu, Dongting; Cheng, Weilun; Chen, Tianxi; Wang, Zhaoqing; Liu, Feng; Liu, Tongliang; Gong, Mingming

Zhenchen Wan, Yanwu Xu, Dongting Hu, Weilun Cheng, Tianxi Chen, Zhaoqing Wang, Feng Liu, Tongliang Liu, Mingming Gong; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025, pp. 2006-2015

Abstract

Recent advancements in Virtual Try-On (VITON) have achieved remarkable realism and texture fidelity, largely attributed to the emergence of text-to-image (T2I) diffusion models. However, prevailing Unet-based T2I backbones are increasingly inadequate for rendering fine-grained garment details, particularly in preserving textual elements and subtle textures. Diffusion Transformer (DiT)-based architectures, with their superior generative capacity, offer a promising alternative, yet their integration into current VITON pipelines is impeded by substantial architectural mismatches. To address these challenges, we propose a novel mask-based framework augmented with three key components: a Garment Semantic (GS)-Adapter for enhanced garment-specific representation, a Text Preservation Loss to maintain high-fidelity text rendering, and LLM-driven semantic guidance for improved alignment between textual prompts and visual outputs. While effective, the mask-based approach still relies heavily on user-provided masks, introducing complexity and potential inaccuracies. To mitigate these issues, we further develop a Mask-Free strategy by leveraging a synthesized dataset to adapt our model and other mask-dependent baselines for mask-independent garment transfer. This approach eliminates the need for explicit masks while preserving garment shape and texture. Experimental results demonstrate that our Mask-Free model consistently outperforms state-of-the-art mask-based methods, establishing a new benchmark in both visual fidelity and usability.

Related Material

[pdf]

[bibtex]

@InProceedings{Wan_2025_ICCV, author = {Wan, Zhenchen and Xu, Yanwu and Hu, Dongting and Cheng, Weilun and Chen, Tianxi and Wang, Zhaoqing and Liu, Feng and Liu, Tongliang and Gong, Mingming}, title = {MFT-VITON: High-Fidelity Virtual Try-On with Minimal Input via a Mask-Free Transformer-Diffusion Model}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {2006-2015} }