InSight: How Robots Learn New Skills Without Human Help (2026)

Teaching robots new manipulation skills is expensive. Collecting human demonstrations and fine-tuning a policy requires substantial human effort for every new task. Vision-language-action (VLA) models have made progress toward general-purpose manipulation, but their capabilities remain bounded by the skills present in their training data. This process is analogous to how humans encounter a novel scenario: we understand what skills we can already perform, and thus recognize when current skills are insufficient. We then reason about what new capability would bridge the gap and learn using targeted practice. The acquired skill can then be stored as a reusable capability for future tasks, thus enabling continual, lifelong learning.

We propose InSight, a framework for open-world skill acquisition via steerable VLAs. We show how a VLA can be made steerable at the level of composable manipulation primitives and then autonomously extended when a novel task requires a missing primitive.

Primitive Segmentation from Demonstrations

An automatic primitive segmentation pipeline decomposes teleoperated demonstrations into labeled primitives without manual annotation, enabling primitive-level VLA steerability. Demonstrations are segmented offline in three stages. First, the VLM decomposes the task instruction into an ordered primitive sequence. Second, the subsampled video is passed frame-by-frame and each frame is assigned to a plan primitive, cross-checking the image against a per-frame end-effector motion caption that reports the dominant translation/rotation axis, then returns the boundary frames between consecutive primitives. Third, each boundary is refined by a localized pass that reconciles the end-effector delta change-point with the earliest visually unambiguous frame. The result is a set of contiguous, primitive-labeled segments, each of which becomes one training episode.

Visualization of demonstration segmentation boundaries and primitive labeling

VLA with Steerable Primitives

We define a skill as a target capability described by a language instruction (e.g., "unscrew the bottle cap and pour the contents into the bowl"). A plan is the sequence of primitives that the VLM planner generates to complete a skill.

Comparison of primitive gap identification and acquisition process

VLM-Guided Skill Acquisition

Given a steerable VLA trained on a base set of primitives, InSight autonomously expands the skill set when presented with a novel task that requires missing primitives. First, the VLM decomposes the task into a primitive sequence and compares against the known primitive vocabulary. Primitives not in the vocabulary are flagged as primitive gaps. The planner is constrained to return one single-axis motion per primitive gap. Therefore, tasks requiring multiple distinct motions (e.g., tilt forward and then tilt back) produce multiple primitive gaps rather than a single composite primitive.

A VLM-guided primitive acquisition loop identifies missing primitives for novel tasks, executes them with VLM-derived parameters, and retrains the VLA on autonomously generated demonstrations to accomplish new skills.

Simulation Results: Block Flipping from Pick-and-Place Demos

We evaluate InSight across simulation and real-world manipulation tasks. In simulation, we use a 7DoF Franka Panda in the LIBERO environment to study block flipping from pick-and-place demonstrations. The robot is asked to flip a Lego block such that the peg is facing right side up, given only human demonstrations of block pick-and-place. We collect 150 human teleoperated pick-and-place demonstrations, where the block is on its side. We automatically segment these demos into over 700 primitive episodes across seven primitive types. The block-flip task requires a rotate-block primitive that is not present in pick-and-place demonstrations, and the VLM identifies it as a primitive gap.

Hardware Validation Across Multiple Tasks

On hardware, we use a 6DoF UFactory xArm to evaluate bottle twisting and pouring to compare against a Code-as-Policies-style zero-shot baseline, and then compose the individually acquired twist and pour primitives along with the base pick-and-place skills into a long-horizon twist-then-pour task. We measure whether the unified policy retains its original pick-and-place skills after new primitives are added. Finally, we evaluate whether InSight extends to contact-rich, non-prehensile motions by acquiring a sweeping primitive from scooping demonstrations.

Hardware setup for pouring and twisting tasks on the UFactory xArm

Key Results

We validate InSight across five tasks in simulation and on hardware, including block flipping, drawer closing, sweeping, twisting, and pouring. The framework enables autonomous skill acquisition with zero target-skill human demonstrations, achieving up to 96% success on tasks such as pouring, and 80% success on a complex 14-primitive long-horizon task while retaining full performance on original base skills.

Conclusion, Limitations, and Future Work

We present InSight, a method for autonomous skill acquisition in VLAs through VLM-guided primitive gap discovery and execution. By training on autonomously segmented primitives, identifying primitive gaps via VLM reasoning, and generating training data through VLM-guided low-level control, InSight enables robots to acquire new skills without additional human demonstrations.

Frequently Asked Questions

How does InSight identify which primitives are missing for a novel task? The VLM decomposes the task into a primitive sequence and compares each primitive against the known vocabulary. Any primitive not already in the vocabulary is flagged as a primitive gap requiring acquisition.

Does InSight require any human demonstrations for the new skill being acquired? No. InSight achieves zero-shot skill acquisition with zero target-skill human demonstrations, generating all training data autonomously through VLM-guided low-level control.

Can InSight add new primitives without forgetting previously learned skills? Yes. Experiments show the unified policy retains full performance on original base skills after new primitives are added and trained.

How many primitive types can InSight handle in a complex long-horizon task? InSight achieved 80% success on a complex 14-primitive long-horizon task, demonstrating scalability to extended manipulation sequences.