-
[pdf]
[supp]
[bibtex]@InProceedings{Gultekin_2025_ICCV, author = {G\"ultekin, Furkan and Koz, Alper and Bahmanyar, Reza and Azimi, Seyed M. and S\"uzen, Mehmet L\"utfi}, title = {Fusing Convolution and Vision Transformer Encoders for Object Height Estimation from Monocular Satellite and Aerial Images}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {3709-3718} }
Fusing Convolution and Vision Transformer Encoders for Object Height Estimation from Monocular Satellite and Aerial Images
Abstract
Accurate height estimation from aerial and satellite imagery is crucial for large-scale 3D scene modeling, which has applications in urban planning, environmental monitoring, and disaster management. In this work, we propose integrating convolutional neural networks (CNNs) and vision transformers (ViTs) to leverage both local and global feature extraction. Our experiments show that using a combination of CNN and ViT encoders significantly improves accuracy compared to relying on either one alone, as CNNs capture fine details while ViTs enhance contextual understanding. Additionally, we incorporate a segmentation head to enhance pixel-level precision, particularly at object boundaries. Evaluated on the DFC2019 and DFC2023 datasets, our proposed fusion approach outperforms baseline methods across multiple metrics. For instance, root-mean-squared error is reduced by 5%-13%, and accuracy improves by 4%-9% in the delta threshold metric. The results also demonstrate strong generalizability across diverse sensors, acquisition altitudes, viewing angles, and real-world scenarios. Our models are released at https://github.com/Furkangultekin/FusedHE.
Related Material
