ViT-YOLO:Transformer-Based YOLO for Object Detection

Zixiao Zhang, Xiaoqiang Lu, Guojin Cao, Yuting Yang, Licheng Jiao, Fang Liu; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2021, pp. 2799-2808


Drone captured images have overwhelming characteristics including dramatic scale variance, complicated background filled with distractors, and flexible viewpoints, which pose enormous challenges for general object detectors based on common convolutional networks. Recently, the design of vision backbone architectures that use selfattention is an exciting topic. In this work, an improved backbone MHSA-Darknet is designed to retain sufficient global context information and extract more differentiated features for object detection via multi-head self-attention. Regarding the path-aggregation neck, we present a simpleyethighlyeffectiveweightedbi-directionalfeaturepyramid network (BiFPN) for effectively cross-scale feature fusion. In addition, other techniques including time-test augmentation (TTA) and wighted boxes fusion (WBF) help to achieve better accuracy and robustness. Our experiments demonstrate that ViT-YOLO significantly outperforms the state-of-the-art detectors and achieve one of the top resultsinVisDrone-DET2021challenge(39.41mAPfortestchallenge data set and 41 mAP for the test-dev data set).

Related Material

@InProceedings{Zhang_2021_ICCV, author = {Zhang, Zixiao and Lu, Xiaoqiang and Cao, Guojin and Yang, Yuting and Jiao, Licheng and Liu, Fang}, title = {ViT-YOLO:Transformer-Based YOLO for Object Detection}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2021}, pages = {2799-2808} }