NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

Abstract

Are large vision-language models (VLMs) truly effective? In this work, we show that popular VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples. Unlike previous VQA benchmarks such as MME that can be addressed by blind QA models, NaturalBench avoids such shortcuts by pairing each question with two images that yield different answers. We use a surprisingly simple procedure to collect adversarial VQA samples from natural image-text corpora using foundation models like CLIP and ChatGPT. We collect a new NaturalBench benchmark for reliably evaluating VLMs with over 10,000 human-verified VQA samples. We note several interesting findings:

NaturalBench is hard. Popular VLMs like InstructBLIP, LLaVA-NeXT, ShareGPT4V, and XGen-MM (BLIP-3) only achieve 1%-15% above random chance performance. Even the best (closed-source) GPT-4o lags significantly behind human performance (which is above 90%).
NaturalBench is compositional. NaturalBench requires diverse visio-linguistic skills, including understanding attribute bindings, object relationships, and advanced reasoning like logic and counting. To this end, unlike prior work that uses a single tag per sample, we tag each NaturalBench sample with 1 to 8 skill tags for fine-grained evaluation.
NaturalBench exposes significant biases in VLMs. Most VLMs choose the same answer regardless of the input image (or question). We show that debiasing can be crucial for VLM performance.

Natural Adversarial Samples for Vision-Language Models

NaturalBench examples consist of two questions and two images with alternating answers to prevent ``blind'' models from scoring well (e.g., those that predict the same answer regardless of the image or question, as discussed in the paper). We compare the ground-truth answer for each (image, question) pair with predictions from leading VLMs including GPT-4o, GPT-4v, BLIP3, LLaVA-NeXT, InstructBLIP, mPLUG-Owl2.1, Qwen-VL, and InternVL. Even the best models like GPT-4o lags far behind human performance (which is above 90%). Moreover, these samples are suprisingly easy to collect:

We use a semi-automated procedure to collect NaturalBench from natural image-text corpora like Flickr30K and DOCCI. First, we identify confounding pairs of image-text samples that fail discriminative VLMs like CLIP, e.g., they wrongly match an image with another image's caption. Next, we prompt ChatGPT (or GPT4-Vision) to design questions that yield different answers for each image, providing the original captions (or images) in the prompt. We hire human annotators to filter out incorrect or irrelevant VQA samples. This process is much simpler than classic adversarial benchmarks as we do not target any specific VQA models nor perturb the images or questions.

Preventing Blind Solutions

Without careful curation, VQA benchmarks may be solved by blind QA models that ignore the images. Recent benchmarks often include questions solvable through commonsense knowledge. For example, a question from MMMU asks, "What is the common term for the yellow area surrounding the site of an infection?" The correct answer is "Halo", as the other options "Corona", "Border", and "Toxin zone" can be ruled out with medical knowledge. We refer interested readers to our paper and MMStar for further discussion on the same issue. Another easily overlooked bias is imbalanced answers. For example, in the popular MME benchmark, the question "Does this artwork exist in the form of a painting?" is answered "Yes" 97.5% of the time! We show that such spurious answer patterns can be exploited by finetuning a "blind" ChatGPT-3.5:

We use a random half of each benchmark for training and test on the other half. Finetuning a blind LLM (GPT-3.5) using only QA data (without images) significantly outperforms random chance and sometimes even matches the performance of LLaVA-1.5 finetuned with images (more benchmarks are reported in the paper). In contrast, NaturalBench enforces a balanced answer distribution for each question and image, ensuring blind solutions only achieve random chance performance. To better understand model performance, we also introduce three additional metrics. We define the "question accuracy" (Q-Acc) metric to award a point only if a model correctly answers a question for both images. Similarly, the "image accuracy" (I-Acc) metric awards a point when a model correctly answers both questions for an image. Lastly, the "group accuracy" (G-Acc) metric awards one point when a model correctly answers all four (image, question) pairs in a test sample.

Model Evaluation

TODO: with 37 MODEL RESULTS on the NaturalBench benchmark.

We report the performance of 37 leading VLMs on NaturalBench. All models significantly lag behind human performance, with the performance gap (in G-Acc) between humans and models highlighted in red. Interestingly, VLMs with larger language models do not always perform better. Even the best closed-source GPT-4o is still significantly behind humans.

What are the Challenges?

Compositionality: Solving a NaturalBench sample often requires a combination of skills, including object recognition, attribute binding, relation understanding, and advanced reasoning such as logic, comparison, differentiation (instance discrimination), counting, and world knowledge. We tag each (image, question) pair with all associated skills for a fine-grained analysis.

Biases: NaturalBench exposes VLMs' biases towards certain answers like "Yes" regardless of the input image and question. We use the answer likelihood (VQAScore) to perform a scoring-based evaluation by comparing the likelihood of correct (image, question, answer) triples over the incorrect ones to show that proper debiasing can lead to huge performance gains.

Towards Dynamic Evaluation

Since benchmarks often leak into foundation models' training data, it is crucial to update benchmarks using new data sources. Our benchmark curation method can easily adapt to new image-text datasets. We expand NaturalBench by incorporating two recently proposed datasets: (1) DOCCI with fine-grained captions over 100 words, and (2) XM3600 with captions in Chinese and Hindi. We hope our efforts will inspire future work in studying dynamic evaluations of VLMs.

BibTeX

@TODO{li2024natural,
        title={NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples},
        author={Li, Baiqi and Lin, Zhiqiu and Peng, Wenxuan and Nyandwi, Jean de Dieu and Jiang, Daniel and Khanuja, Simran and Ma, Zixian and Krishna, Ranjay and Neubig, Graham and Ramanan, Deva},
        booktitle={TODO},
        year={2024}
      }