Hello! I’m Baiqi Li, a first-year PhD student in Computer Science at the University of North Carolina at Chapel Hill, advised by Gedas Bertasius. My research interests focus on video-related topics and their applications in robotics. Before joining UNC, I had the privilege of conducting research with Deva Ramanan. I gruduated from East China Normal University in Shanghai, China.

šŸ”„ News

šŸ“ Publications

NeurIPS24
sym

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

Baiqi Li*, Zhiqiu Lin*, Wenxuan Peng*, Jean de Dieu Nyandwi*, Daniel Jiang, Zixian Ma, Simran Khanuja, Ranjay Krishna †, Graham Neubig †, Deva Ramanan †

Website | Arxiv | HuggingFace | Evaluation Code

  • In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples.
  • We propose a semi-automated approach to collect a new benchmark, NaturalBench, for reliably evaluating VLMs with over 10,000 human-verified VQA samples.
  • Evaluated NaturalBench on 53 vision-language models, including both open-source and closed-source examples like GPT4-o, Qwen2-VL, Molmo, and InternVL
  • Mitigating biases, such as the tendency of GPT-4o to agree with most questions due to lack of calibration, can yield a 100\% improvement in model performance.
ECCV24
sym

Evaluating Text-to-Visual Generation with Image-to-Text Generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Emily Li, Xide Xia, Graham Neubig, Pengchuan Zhang †, Deva Ramanan †

Website | Arxiv | code

  • We propose VQAScore, the state-of-the-art alignment metric for text-to-image/video/3D models.
  • VQAScore based on our new CLIP-FlanT5 model outperforms previous metrics based on GPT-4Vision or costly human feedback.
In submission
sym

GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

Baiqi Li*, Zhiqiu Lin*, Deepak Pathak, Emily Li, Feiyi Xin, Kewen Wu, Tiffany Ling, Xide Xia †, Pengchuan Zhang †, Graham Neubig †, Deva Ramanan †

Website | Arxiv | HuggingFace

  • We conduct an extensive human study on compositional text-to-visual generation using GenAI-Bench, revealing limitations of leading open-source and closed-source models.
  • We present a simple black-box approach that improves generation by ranking images with VQAScore, significantly surpassing other scoring methods by 2x to 3x.
  • We will release GenAI-Rank with over 40,000 human ratings to benchmark methods that rank images generated from the same prompt.

šŸŽ– Honors and Services

  • Reviewer: Neurips, ICLR, ECAI, TIST …

šŸ“– Research Experience

  • 2025.08 - present, PhD, University of North Carolina at Chapel Hill.
  • 2023.08 - 2025.05, Research Assistant, Carnegie Mellon University.