-
[pdf]
[arXiv]
[bibtex]@InProceedings{Drapier_2025_ICCV, author = {Drapier, Nicolas and Chetouani, Aladine and Chateigner, Aur\'elien}, title = {Combining Transformers and CNNs for Efficient Object Detection in High-Resolution Satellite Imagery}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {2937-2946} }
Combining Transformers and CNNs for Efficient Object Detection in High-Resolution Satellite Imagery
Abstract
We present GLOD, a transformer-first architecture for object detection in high-resolution satellite imagery. GLOD replaces CNN backbones with a Swin Transformer for end-to-end feature extraction, combined with novel UpConvMixer blocks for robust upsampling and Fusion Blocks for multi-scale feature integration. Our approach achieves 32.95% on xView, outperforming SOTA methods by 11.46%. Key innovations include asymmetric fusion with CBAM attention and a multi-path head design capturing objects across scales. The architecture is optimized for satellite imagery challenges, leveraging spatial priors while maintaining computational efficiency.
Related Material
