EBench: A New Benchmark Diagnoses Mobile Manipulation Robots' Core Capabilities (2026)

Q: What makes EBench different from other robot benchmarks?

Most benchmarks report a single success rate or task average. EBench breaks performance down into five independent dimensions to reveal a robot's true capability profile.

Q: How are the 26 tasks collected?

Dexterous tasks (e.g., peg insertion, nut tightening) use human teleoperation with a mirrored setup. Long-horizon tasks (e.g., multi-step assembly) use motion planning, since teleoperating long sequences is too failure-prone.

Q: What are the five evaluation dimensions?

Operating mode (fixed vs. mobile), temporal horizon (short vs. long), precision (coarse vs. dexterous), atomic skill (specific manipulation action), and scene category (type of environment).

Q: Can EBench predict how a policy will perform on a real robot?

Not yet — the benchmark is currently simulation-only. The authors plan to study sim-to-real correlation in future work.

Researchers have built EBench, a benchmark of 26 mobile manipulation tasks across nine scene categories that systematically diagnoses the strengths and weaknesses of generalist robot policies. Instead of a single score, EBench breaks down performance across five capability dimensions — revealing why aggregate success rates can hide critical gaps in dexterity, long-horizon planning, or environment adaptability.

What the Researchers Built

EBench is an evaluation framework designed to diagnose generalist mobile manipulation policies — the kind of robot brains that drive humanoid robots or warehouse robots through unstructured environments. It contains 26 carefully designed tasks sampled from nine scene categories such as kitchens, industrial labs, and storage areas.

What makes EBench unique is its five-dimensional capability breakdown: operating mode (fixed-base vs. mobile), temporal horizon (short vs. long-horizon tasks), precision (coarse vs. dexterous), atomic skill (picking, placing, inserting, tightening, gear meshing, and more), and scene category. To generate training data, the team coupled two complementary streams: kinematically isomorphic teleoperation for contact-rich dexterous tasks (e.g., peg-in-hole, nut tightening), and motion planning for long-horizon sequences that are nearly impossible to teleoperate reliably due to cumulative failure probability.

data synthesis diagram showing teleoperation and motion planning streams

The result is a reproducible "screening substrate" that lets researchers see exactly where a policy excels and where it falls short — far more informative than a single average success rate.

Key Results

When the researchers evaluated four state-of-the-art generalist mobile manipulation policies with EBench, they found that aggregate success rates were deceptively similar. The real value emerged from the five-dimensional breakdown.

Key findings include: - No single policy dominated across all tasks. The best performer on dexterous insertion tasks often struggled with long-horizon navigation-and-grasp sequences. - Operating mode had a strong effect: policies that performed well on fixed-base tasks sometimes degraded significantly when the base had to move simultaneously. - Temporal horizon exposed a stark trade-off: policies that could succeed on short-horizon pick-and-place tasks often failed on tasks requiring 8–12 steps. - Precision was the hardest dimension: contact-rich tasks like gear meshing and nut tightening were failed by most policies, regardless of their performance on coarse tasks. - Scene category introduced further variance: a policy that handled kitchen scenes well might drop 40% in success rate when tested in an industrial lab layout.

These results confirm that evaluating a mobile manipulation policy by a single number — or even a handful of scenes — is misleading. EBench provides the diagnostic lens needed to guide both research priorities and practical robot selection.

How It Works

EBench operates entirely in simulation, using a high-fidelity physics engine. The benchmark covers 26 tasks grouped into 9 scene categories, each designed to isolate specific capability factors.

Five Evaluation Dimensions

Dimension	Description	Example Task Pairs
Operating Mode	Fixed base vs. mobile base	Peg insertion on table vs. peg insertion while driving
Temporal Horizon	Short (1–3 steps) vs. long (8–12 steps)	Pick-place vs. navigate-then-pick-then-insert-then-stow
Precision	Coarse (>5 cm tolerance) vs. dexterous (<1 mm)	Block stacking vs. key insertion
Atomic Skill	Visual-only differentiation	Pick vs. place vs. screw vs. gear mesh
Scene Category	Kitchen, lab, storage, etc.	Same task in different environments

chart showing capability breakdown across five dimensions

Data Synthesis Pipeline

The team used two parallel collection streams. For 7 dexterous tasks (e.g., peg insertion, nut tightening, gear meshing), they set up a kinematically isomorphic teleoperation system: a human operator controls a follower arm that mirrors the leader exactly, preserving the micro-corrections needed for contact-rich manipulation. For long-horizon tasks (e.g., "pick part A from bin, move to assembly station, insert B, then return to start"), they relied on motion planning — because teleoperating a 20-step sequence without any failure is nearly impossible.

Each task includes multiple "perspectives" (camera viewpoints) and multiple initialization conditions to increase diversity. The benchmark then computes success rates per dimension, enabling the diagnostic radar plots that make EBench valuable.

Why This Matters for Robotics

For anyone evaluating robots — whether you're a warehouse manager comparing used cobots for sale or a researcher developing next-generation humanoid controllers — EBench offers three practical benefits.

First, it prevents misleading conclusions. A policy that scores 80% in a kitchen might only work because it's strong at open-loop grasping but weak at fine manipulation. EBench separates those factors.

Second, it accelerates debugging. If your robot fails a real-world task, EBench helps you pinpoint whether the failure is in perception, dexterous control, or long-horizon planning — before you spend hours on physical trials.

Third, it enables better procurement decisions. A warehouse robot that handles long routes but drops precision tasks is a different product than one that excels at assembly. EBench scores can help buyers match robot capabilities to job requirements.

The benchmark is also reproducible and open, meaning the entire community can compare policies on the same playing field — something missing in most current evaluations.

Limitations and Open Questions

EBench currently operates entirely in simulation, and the authors explicitly caution that simulation scores do not guarantee real-world performance. The benchmark is intended as a "screening substrate" that precedes physical evaluation, not a replacement for it. Correlation between simulated and real performance remains an open question that the team plans to study.

The 26-task suite sparsely covers the nine scene categories, so scene-level rankings should be considered preliminary. Expanding to hundreds of tasks is on the roadmap, which would unlock regression-based analysis and reduce statistical noise.

Finally, the benchmark tests only mobile manipulation — it does not assess human-robot interaction, learning from human feedback, or safety. These are important dimensions for real-world deployment that EBench currently leaves out.

Frequently Asked Questions

What makes EBench different from other robot benchmarks? Most benchmarks report a single success rate or task average. EBench breaks performance down into five independent dimensions to reveal a robot's true capability profile.

How are the 26 tasks collected? Dexterous tasks (e.g., peg insertion, nut tightening) use human teleoperation with a mirrored setup. Long-horizon tasks (e.g., multi-step assembly) use motion planning, since teleoperating long sequences is too failure-prone.

What are the five evaluation dimensions? Operating mode (fixed vs. mobile), temporal horizon (short vs. long), precision (coarse vs. dexterous), atomic skill (specific manipulation action), and scene category (type of environment).

Can EBench predict how a policy will perform on a real robot? Not yet — the benchmark is currently simulation-only. The authors plan to study sim-to-real correlation in future work.

Conclusion

EBench fills a critical gap in mobile manipulation evaluation by moving beyond aggregate success rates to a multi-dimensional diagnostic framework. Its 26-task suite, two-stream data collection, and five-axis analysis give researchers and buyers a clearer picture of where a policy truly excels — and where it needs work.