Attack To Defend: Exploiting Adversarial Attacks for Detecting Poisoned Models

Samar Fares, Karthik Nandakumar; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 24726-24735

Abstract


Poisoning (trojan/backdoor) attacks enable an adversary to train and deploy a corrupted machine learning (ML) model which typically works well and achieves good accuracy on clean input samples but behaves maliciously on poisoned samples containing specific trigger patterns. Using such poisoned ML models as the foundation to build real-world systems can compromise application safety. Hence there is a critical need for algorithms that detect whether a given target model has been poisoned. This work proposes a novel approach for detecting poisoned models called Attack To Defend (A2D) which is based on the observation that poisoned models are more sensitive to adversarial perturbations compared to benign models. We propose a metric called sensitivity to adversarial perturbations (SAP) to measure the sensitivity of a ML model to adversarial attacks at a specific perturbation bound. We then generate strong adversarial attacks against an unrelated reference model and estimate the SAP value of the target model by transferring the generated attacks. The target model is deemed to be a trojan if its SAP value exceeds a decision threshold. The A2D framework requires only black-box access to the target model and a small clean set while being computationally efficient. The A2D approach has been evaluated on four standard image datasets and its effectiveness under various types of poisoning attacks has been demonstrated

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Fares_2024_CVPR, author = {Fares, Samar and Nandakumar, Karthik}, title = {Attack To Defend: Exploiting Adversarial Attacks for Detecting Poisoned Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {24726-24735} }