Lite-MDETR: A Lightweight Multi-Modal Detector

Qian Lou, Yen-Chang Hsu, Burak Uzkent, Ting Hua, Yilin Shen, Hongxia Jin; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 12206-12215


Recent multi-modal detectors based on transformers and modality encoders have successfully achieved impressive results on end-to-end visual object detection conditioned on a raw text query. However, they require a large model size and an enormous amount of computations to achieve high performance, which makes it difficult to deploy mobile applications that are limited by tight hardware resources. In this paper, we present a Lightweight modulated detector, Lite-MDETR, to facilitate efficient end-to-end multi-modal understanding on mobile devices. The key primitive is that Dictionary-Lookup-Transformormations (DLT) is proposed to replace Linear Transformation (LT) in multi-modal detectors where each weight in Linear Transformation (LT) is approximately factorized into a smaller dictionary, index, and coefficient. This way, the enormous linear projection with weights is converted into lite linear projection with dictionaries, a few lookups and scalings with indices and coefficients. DLT can be directly applied to pre-trained detectors, removing the need to perform expensive training from scratch. To tackle the challenging training of DLT due to the non-differentiable index, we convert the index and coefficient into a sparse matrix, train this sparse matrix during the fine-tuning phase, and recover it back to index and coefficient during the inference phase. Extensive experiments on several tasks such as phrase grounding, referring expression comprehension and segmentation show that our Lite-MDETR achieves similar detection accuracy to the prior multi-modal detectors with ~ 4.1xmodel size reduction.

Related Material

@InProceedings{Lou_2022_CVPR, author = {Lou, Qian and Hsu, Yen-Chang and Uzkent, Burak and Hua, Ting and Shen, Yilin and Jin, Hongxia}, title = {Lite-MDETR: A Lightweight Multi-Modal Detector}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {12206-12215} }