AAFormer: A Multi-Modal Transformer Network for Aerial Agricultural Images
The semantic segmentation of agricultural aerial images is very important for the recognition and analysis of farmland anomaly patterns, such as drydown, endrow, nutrient deficiency, etc. General semantic segmentation algorithms such as Fully Convolutional Networks can extract rich semantic feature information, but it is difficult to exploit the long-range vision information. Recently, vision Transformer architectures have made outstanding performances in image segmentation tasks, but it has not been fully explored in the field of agriculture. Therefore, we propose a novel architecture called Agricultural Aerial Transformer (AAFormer) to solve the semantic segmentation of aerial farmland images. We adopt Mix Transformer (MiT) in the encoder stage to enhance the ability of field anomaly pattern recognition and leverage the Squeeze-and-Excitation (SE) module in the decoder stage to improve the effectiveness of key channels. The boundary maps of farmland are introduced into the decoder. Evaluated on the Agriculture-Vision validation set, the mIoU of our proposed model reaches 45.44%.