Hello! I am Baiqi Li, a research assistant at Carnegie Mellon University, wadvised by Professor Deva Ramanan. I graduated in June 2024 with a master’s degree from the School of Computer Science at East China Normal University. My research focuses on computer vision and language, particularly on evaluation metrics and benchmarks for the cognitive abilities of VLMs and improvement of multimodal generative models.

I am looking for a PhD/RA opportunity starting in 2025. If you are interested in my profile, please feel free to contact me. πŸ˜ƒ

πŸ”₯ News

πŸ“ Publications

NeurIPS24
sym

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

Baiqi Li*, Zhiqiu Lin*, Wenxuan Peng*, Jean de Dieu Nyandwi*, Daniel Jiang, Zixian Ma, Simran Khanuja, Ranjay Krishna †, Graham Neubig †, Deva Ramanan †

Website | Arxiv | HuggingFace | Evaluation Code

  • In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples.
  • We propose a semi-automated approach to collect a new benchmark, NaturalBench, for reliably evaluating VLMs with over 10,000 human-verified VQA samples.
  • Evaluated NaturalBench on 53 vision-language models, including both open-source and closed-source examples like GPT4-o, Qwen2-VL, Molmo, and InternVL
  • Mitigating biases, such as the tendency of GPT-4o to agree with most questions due to lack of calibration, can yield a 100\% improvement in model performance.
ECCV24
sym

Evaluating Text-to-Visual Generation with Image-to-Text Generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Emily Li, Xide Xia, Graham Neubig, Pengchuan Zhang †, Deva Ramanan †

Website | Arxiv | code

  • We propose VQAScore, the state-of-the-art alignment metric for text-to-image/video/3D models.
  • VQAScore based on our new CLIP-FlanT5 model outperforms previous metrics based on GPT-4Vision or costly human feedback.
In submission
sym

GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

Baiqi Li*, Zhiqiu Lin*, Deepak Pathak, Emily Li, Feiyi Xin, Kewen Wu, Tiffany Ling, Xide Xia †, Pengchuan Zhang †, Graham Neubig †, Deva Ramanan †

Website | Arxiv | HuggingFace

  • We conduct an extensive human study on compositional text-to-visual generation using GenAI-Bench, revealing limitations of leading open-source and closed-source models.
  • We present a simple black-box approach that improves generation by ranking images with VQAScore, significantly surpassing other scoring methods by 2x to 3x.
  • We will release GenAI-Rank with over 40,000 human ratings to benchmark methods that rank images generated from the same prompt.

πŸŽ– Honors and Services

  • Reviewer: ICLR, ECAI, TIST …

πŸ“– Research Experience

  • *2023.08 - present, Research Assistant, Carnegie Mellon University.
  • *2021.09 - 2024.06, Master student, East China Normal University.

πŸ’¬ Invited Talks

  • 2024.05, I presented [GenAI-Bench/NaturalBench Benchmark] at CMU.