QR-DETR : Query Routing for Detection Transformer

Tharsan Senthivel, Ngoc-Son Vu; Proceedings of the Asian Conference on Computer Vision (ACCV), 2024, pp. 354-371

Abstract


Detection Transformer (DETR) predicts object bounding boxes and classes from learned object queries. However, DETR exhibits three major flaws: (1) Only a subset of object queries contribute to the final predictions, leading to inefficient utilization of computational resources. (2) The self-attention and feed-forward layers indiscriminately mix information across object queries without any guidance, potentially hindering effective learning of object representations. (3) At each decoder stack layers, a query can evolve either positively, refining its bounding box and class attributes correctly, or negatively, shifting to predict a different object or increasing its bounding box erroneously. This suggest that query informativeness is non-uniform, and enabling inter-query communication could impede the learning of specialized representations for individual queries. To address these concerns, we propose a learnable query routing method that introduces a routing model to identify the object queries requiring processing at each transformer decoder layer. The selected queries pass through the entire decoder stack, while others exit early. Subsequently, all queries are scattered to their original positions after feed-forward processing of the passed queries. This process prevents indiscriminate information sharing across all queries. Extensive experiments on the COCO dataset demonstrate the effectiveness of our method, showing a consistent increase in mAP across multiple DETR models.

Related Material


[pdf]
[bibtex]
@InProceedings{Senthivel_2024_ACCV, author = {Senthivel, Tharsan and Vu, Ngoc-Son}, title = {QR-DETR : Query Routing for Detection Transformer}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {354-371} }