Are large vision-language models (VLMs) truly effective? In this work, we show that popular VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples. Unlike previous VQA benchmarks such as MME that can be addressed by blind QA models, NaturalBench avoids such shortcuts by pairing each question with two images that yield different answers. We use a surprisingly simple procedure to collect adversarial VQA samples from natural image-text corpora using foundation models like CLIP and ChatGPT. We collect a new NaturalBench benchmark for reliably evaluating VLMs with over 10,000 human-verified VQA samples. We note several interesting findings:
NaturalBench examples consist of two questions and two images with alternating answers to prevent ``blind'' models from scoring well (e.g., those that predict the same answer regardless of the image or question, as discussed in the paper). We compare the ground-truth answer for each (image, question) pair with predictions from leading VLMs including GPT-4o, GPT-4v, BLIP3, LLaVA-NeXT, InstructBLIP, mPLUG-Owl2.1, Qwen-VL, and InternVL. Even the best models like GPT-4o lags far behind human performance (which is above 90%). Moreover, these samples are suprisingly easy to collect:
We use a semi-automated procedure to collect NaturalBench from natural image-text corpora like Flickr30K and DOCCI. First, we identify confounding pairs of image-text samples that fail discriminative VLMs like CLIP, e.g., they wrongly match an image with another image's caption. Next, we prompt ChatGPT (or GPT4-Vision) to design questions that yield different answers for each image, providing the original captions (or images) in the prompt. We hire human annotators to filter out incorrect or irrelevant VQA samples. This process is much simpler than classic adversarial benchmarks as we do not target any specific VQA models nor perturb the images or questions.
We use a random half of each benchmark for training and test on the other half. Finetuning a blind LLM (GPT-3.5) using only QA data (without images) significantly outperforms random chance and sometimes even matches the performance of LLaVA-1.5 finetuned with images (more benchmarks are reported in the paper). In contrast, NaturalBench enforces a balanced answer distribution for each question and image, ensuring blind solutions only achieve random chance performance. To better understand model performance, we also introduce three additional metrics. We define the "question accuracy" (Q-Acc) metric to award a point only if a model correctly answers a question for both images. Similarly, the "image accuracy" (I-Acc) metric awards a point when a model correctly answers both questions for an image. Lastly, the "group accuracy" (G-Acc) metric awards one point when a model correctly answers all four (image, question) pairs in a test sample.
We report the performance of 37 leading VLMs on NaturalBench. All models significantly lag behind human performance, with the performance gap (in G-Acc) between humans and models highlighted in red. Interestingly, VLMs with larger language models do not always perform better. Even the best closed-source GPT-4o is still significantly behind humans.
Biases: NaturalBench exposes VLMs' biases towards certain answers like "Yes" regardless of the input image and question. We use the answer likelihood (VQAScore) to perform a scoring-based evaluation by comparing the likelihood of correct (image, question, answer) triples over the incorrect ones to show that proper debiasing can lead to huge performance gains.
Since benchmarks often leak into foundation models' training data, it is crucial to update benchmarks using new data sources. Our benchmark curation method can easily adapt to new image-text datasets. We expand NaturalBench by incorporating two recently proposed datasets: (1) DOCCI with fine-grained captions over 100 words, and (2) XM3600 with captions in Chinese and Hindi. We hope our efforts will inspire future work in studying dynamic evaluations of VLMs.
@TODO{li2024natural,
title={NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples},
author={Li, Baiqi and Lin, Zhiqiu and Peng, Wenxuan and Nyandwi, Jean de Dieu and Jiang, Daniel and Khanuja, Simran and Ma, Zixian and Krishna, Ranjay and Neubig, Graham and Ramanan, Deva},
booktitle={TODO},
year={2024}
}