Flow Matching for Multimodal Distributions

Gaoxiang Luo, Frank Cole, Sihang Zhang, Yuxiang Wan, Yulong Lu, Ju Sun; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 23260-23271

Abstract


Recently, vision foundation models have been shown to boost the efficiency of flow-based generative models by revealing the intrinsic union-of-manifold structures and lowering the complexity of the latent/target distribution. In this paper, we exploit the multimodality aspect of the union-of-manifold structures, and aim to further improve the learning and inference efficiency for flow-matching models. To this end, we propose an efficient source and coupling co-design method termed Mixture-Modeling Flow Matching (MM-FM), by integrating a data-adaptive multimodal source distribution (implemented as Gaussian mixture models) and mode-dependent data coupling. The former shortens the distance between the source and the target, and the latter promotes local and straighter flows. We also derive theoretical results to confirm our intuition in a quantitative sense. In our experiments on ImageNet256x256 with multimodal DINOv2-B latents, MM-FM exhibits superior learning efficiency and state-of-the-art unconditional generation quality: FID=2.74 with autoguidance in only 80 epochs.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Luo_2026_CVPR, author = {Luo, Gaoxiang and Cole, Frank and Zhang, Sihang and Wan, Yuxiang and Lu, Yulong and Sun, Ju}, title = {Flow Matching for Multimodal Distributions}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {23260-23271} }