-
[pdf]
[supp]
[bibtex]@InProceedings{Jiang_2025_ICCV, author = {Jiang, Zhongyu and Cai, Jiarui and Liu, Chang and An, Dongsheng and Wu, Jonathan}, title = {SynBalance: Harnessing Synthetic Data in Long-tailed Recognition}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {6335-6344} }
SynBalance: Harnessing Synthetic Data in Long-tailed Recognition
Abstract
Real-world datasets frequently exhibit long-tailed distributions, where a small number of classes dominate while the majority are underrepresented. Traditional methods, such as data resampling, loss weighting, and uniform augmentation, provide limited solutions as they fail to address the unique requirements of both head and tail classes effectively. This paper presents a novel approach to tackling the long-tail problem by integrating strategically selected synthetic data, generated with state-of-the-art models like Stable Diffusion and DALL-E, to complement the original long-tailed dataset. We examine the impact of key synthetic data properties, including quantity, diversity, recognizability, and domain gap, through extensive experiments, uncovering insights into their interrelationships. Based on these findings, we propose SynBalance, a comprehensive pipeline that includes: (1) manipulating the data synthesis process to generate synthetic data with desirable properties, (2) synthetic data composition, which selects a complementary synthetic subset tailored to the imbalanced real-world distribution, and (3) SynFusion, a customized training framework designed to integrate the original dataset with the selected synthetic data for improved performance. Extensive evaluations on three widely used benchmarks demonstrate significant performance improvements, providing valuable insights into effective dataset composition and learning strategies for long-tailed distributions.
Related Material
