Combining Transformers and CNNs for Efficient Object Detection in High-Resolution Satellite Imagery

Nicolas Drapier, Aladine Chetouani, Aurélien Chateigner; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025, pp. 2937-2946

Abstract


We present GLOD, a transformer-first architecture for object detection in high-resolution satellite imagery. GLOD replaces CNN backbones with a Swin Transformer for end-to-end feature extraction, combined with novel UpConvMixer blocks for robust upsampling and Fusion Blocks for multi-scale feature integration. Our approach achieves 32.95% on xView, outperforming SOTA methods by 11.46%. Key innovations include asymmetric fusion with CBAM attention and a multi-path head design capturing objects across scales. The architecture is optimized for satellite imagery challenges, leveraging spatial priors while maintaining computational efficiency.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Drapier_2025_ICCV, author = {Drapier, Nicolas and Chetouani, Aladine and Chateigner, Aur\'elien}, title = {Combining Transformers and CNNs for Efficient Object Detection in High-Resolution Satellite Imagery}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {2937-2946} }