QAttn: Efficient GPU Kernels for Mixed-precision Vision Transformers

Kluska, Piotr; Castelló, Adrián; Scheidegger, Florian; Malossi, A. Cristiano I.; Quintana-Ortí, Enrique S.

Piotr Kluska, Adrián Castelló, Florian Scheidegger, A. Cristiano I. Malossi, Enrique S. Quintana-Ortí; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 3648-3657

Abstract

Vision Transformers have demonstrated outstanding performance in Computer Vision tasks. Nevertheless this superior performance for large models comes at the expense of increasing memory usage for storing the parameters and intermediate activations. To accelerate model inference in this work we develop and evaluate integer and mixed-precision kernels in Triton for the efficient execution of two fundamental building blocks of transformers -linear layer and attention- on graphics processing units (GPUs). On an NVIDIA A100 GPU our kernel implementations of Vision Transformers achieve a throughput speedup of up to 7x compared with reference kernels in PyTorch floating- point single precision (FP32). Additionally the accuracy for the ViT Large model top-1 drops by less than one per- cent on the ImageNet1K classification task. We also observe up to 6x increased throughput by applying our kernels to the Segment Anything Model image encoder while keeping the mIOU close to the FP32 reference on the COCO2017 dataset for static and dynamic quantization. Furthermore our kernels demonstrate improved speed to the TensorRT INT8 linear layer and we improve the throughput of base FP16 (half precision) Triton attention on average by up to 19 +- 4.01%. We have open-sourced the QAtnn framework which is tightly integrated with the PyTorch quantization workflow https://github.com/IBM/qattn.

Related Material

[pdf]

[bibtex]

@InProceedings{Kluska_2024_CVPR, author = {Kluska, Piotr and Castell\'o, Adri\'an and Scheidegger, Florian and Malossi, A. Cristiano I. and Quintana-Ort{\'\i}, Enrique S.}, title = {QAttn: Efficient GPU Kernels for Mixed-precision Vision Transformers}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {3648-3657} }