Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

Published in ACL 2025, 2025

We introduce MAC (Multimodal Adversarial Compositionality), a benchmark for evaluating the robustness of pre-trained multimodal models against adversarial text updates. Our study reveals that state-of-the-art vision-language models like CLIP are vulnerable to compositional adversarial attacks generated by LLMs, highlighting the need for more robust multimodal representations.

Paper (arXiv)Paper (ACL Anthology)Code

Recommended citation: Jaewoo Ahn, Heeseung Yun, Dayoon Ko, Gunhee Kim. (2025). "Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates." ACL 2025.
Download Paper