TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs
Abstract
Fine-grained spatio-temporal understanding is essential for video reasoning and embodied AI. Yet, while Multimodal Large Language Models (MLLMs) master static semantics, their grasp of temporal dynamics remains brittle. We present TimeBlind, a diagnostic benchmark for compositional spatio-temporal understanding. Inspired by cognitive science, TimeBlind categorizes fine-grained temporal understanding into three levels: recognizing atomic events, characterizing event properties, and reasoning about event interdependencies. Unlike benchmarks that conflate recognition with temporal reasoning, TimeBlind leverages a minimal-pairs paradigm: video pairs share identical static visual content but differ solely in temporal structure, utilizing complementary questions to neutralize language priors. Evaluating over 20 state-of-the-art MLLMs (e.g., GPT-5, Gemini 3 Pro) on 600 curated instances (2400 video-question pairs), reveals that the Instance Accuracy (correctly distinguishing both videos in a pair) of the best performing MLLM is only 48.2%, far below the human performance (98.2%). These results demonstrate that even frontier models rely heavily on static visual shortcuts rather than genuine temporal logic, positioning TimeBlind as a vital diagnostic tool for next-generation video understanding.
Inspiration
An example video pair that shares identical static visual content but differs solely in motion dynamics. The top video shows a person shaking a cup while making coffee, while the bottom video shows them holding it still. Even the most advanced models like GPT-5 and Gemini 3 Pro fail to distinguish the actions in the video pair.
Taxonomy and Statistics
TimeBlind Taxonomy and Statistics. Left: We structure the evaluation into 11 fine-grained spatio-temporal compositional categories spanning three high-level aspects: Atomic Events (what changes), Parametric Event Attributes (how it changes), and Structural Event Logic (how events compose). Top Right: Distribution of video lengths across the benchmark, showing that most videos fall within the 0–15 seconds. Bottom Right: Distribution of question word counts, indicating that most questions are under 30 words. Overall: Our benchmark features a structured taxonomy with diverse categories while maintaining short videos and concise questions.
Data Generation Pipeline
Overview of the TimeBlind data construction pipeline. Stage 1 (Schema Generation): We prompt GPT-5 to generate paired complementary questions targeting temporal differences. Stage 2 (Video Acquisition): We collect one video pair that matches the generated schema from one of the following sources: (i) Retrieving videos from the internet, (ii) Recording videos with humans, or (iii) Generating videos via simulation (e.g., Unity). We then pair these videos with the questions to form a candidate TimeBlind instance. Stage 3 (Manual Review): Human annotators manually review each instance to ensure: (i) Static Consistency (videos share identical static content), (ii) Temporal Minimality (the pair differs only in the targeted temporal factor), and (iii) Question Validity (QA pairs are clear and correct).
Overall Results
Main Results on TimeBlind. We use uniform sampling at 1 FPS and evaluate all models using default configurations, with I-Acc as the primary metric. The table is divided into Open-Source (grouped by size: <10B and >10B), and Closed-Source models. Our results suggest that all models perform poorly on fine-grained temporal video understanding, with none of the methods achieving over 50% I-Acc. Best results are shown in bold.
BibTeX