Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates
Published in ACL 2025, 2025
We introduce MAC (Multimodal Adversarial Compositionality), a benchmark for evaluating the robustness of pre-trained multimodal models against adversarial text updates. Our study reveals that state-of-the-art vision-language models like CLIP are vulnerable to compositional adversarial attacks generated by LLMs, highlighting the need for more robust multimodal representations.
| Paper (arXiv) | Paper (ACL Anthology) | Code |
Recommended citation: Jaewoo Ahn, Heeseung Yun, Dayoon Ko, Gunhee Kim. (2025). "Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates." ACL 2025.
Download Paper
