WatchAct: A Benchmark for Behavior-Grounded Robot Manipulation

Overview

Motivating Example

A robot working alongside people must reason about what they have done, in what order, and with what intent. Here's an example. The robot watches a person rearrange objects. It's then given an instruction: put the moved objects back where they were. It then executes the plan in a digital twin of the scene. To succeed, the robot must reason from the human video and plan its next actions — a setting overlooked by prior work.

Instruction: “Put the moved objects back where they were.”

What the robot sees of the human

The robot executes the task

WatchAct Instances

Language Instruction: “Put the object that was shaken vigorously into the basket farther from the camera.”

Real Human-action Video

Language Instruction: “Put the displaced boxes back in their original positions.”

Real Human-action Video

Executable task 4 first frame — Executable Task

Language Instruction: “Move objects by imitating human action sequences.”

Real Human-action Video

Executable task 5 first frame — Executable Task

Language Instruction: “Put the object that was picked up three times into the tray.”

Real Human-action Video

Executable task 6 first frame — Executable Task

Language Instruction: “Put the displaced boxes back in their original positions.”

Real Human-action Video

Executable task 7 first frame — Executable Task

Spatial Reasoning Design

Videos are captured from diverse camera viewpoints.

Frontal View

Side View

Oblique View

Frontal View

Side View

Oblique View

Frontal View

Side View

Oblique View
Language instructions use varied reference frames, either human-centered or camera-centered:
- Human-centered: e.g., "the cup on the human's left"
- Camera-centered: e.g., "the cup on the camera's left"

Experimental Takeaways

Takeaway 1: VLMs struggle with video-to-plan reasoning.

The best model, Gemini-3.1-Pro, reaches only 36.8%, a gap of 60.3 points, which indicates that inferring a valid manipulation plan from video remains a major challenge for current VLMs.

Takeaway 2: Robotic policies struggle to follow oracle plans.

Even with oracle plans, current policies struggle on WatchAct.

Takeaway 3: Errors accumulate across the integrated pipeline.

Predicted plans degrade execution relative to oracle plans. Paired with π_0.5, Gemini-3.1-Pro plans reach 16.3% Task SR and GPT-5.4 plans 13.5%, both below the 21.5% oracle-plan ceiling.

Rollout Demos

We show several rollout demos from LIBERO simulation and real-world experiments.

Instruction: “Put the object that was picked up three times into the basket.”

Human Video

Robot Execution Failure

In the human video, the person picks up the white bottle three times. Based on the number of actions, the robot must localize the white bottle and move it to the basket; the robot then moves it to the wooden tray.

Instruction: “Put the object that was shaken vigorously into the basket farther from the camera.”

Human Video

Robot Execution Failure

In the human video, the person vigorously shakes the blue box and gently picks up the red box. The robot only needs to move the blue box to the basket farther from the camera; however, after picking up the blue box, the robot mistakenly moves it to the other basket.

Instruction: “Put the displaced boxes back in their original positions.”

Human Video

Robot Execution Failure

In the human video, the person removes several objects from the basket and places them on the table. The robot is asked to return the displaced objects to their original positions; however, the robot mistakenly places an object that had not been moved into the basket.

Instruction: “Put the displaced boxes back in their original positions.”

Human Video

Robot Execution Success

In the human video, the person takes a blue cream cheese box out of the basket near the camera and places it in the center of the table. The robot successfully picks up the blue box and places it into the correct basket.

Instruction: “Put the moved objects where they were.”

Human Video

Robot Execution Success

In the human video, the person removes a cup from the basket. Following the instruction, the robot must successfully return the displaced cup to its original position.

Instruction: “Put the object that was shaken vigorously into the basket closer to the robot.”

Human Video

Robot Execution Failure

In the human video, the person vigorously shakes the cola can. The robot must recognize the shaking action, localize the cola can, and follow the instruction to pick it up and place it into the basket. However, the robot fails during execution: after picking up the cola can, it fails to put it into the basket.

Instruction: “Imitate the human's actions to manipulate the objects.”

Human Video

Robot Execution Failure

In the human video, the person moves a cup and pours in cola. The robot is asked to imitate this action. After picking up the cola, however, the robot fails at the pouring step.

Overview

Motivating Example

WatchAct Instances

Spatial Reasoning Design

Experimental Takeaways

Rollout Demos

BibTeX