MemoryWAM: Persistent Memory Makes Robot Action Models Faster and Smarter (2026)

Most robot action models forget what happened more than a few seconds ago, causing them to fail at tasks that require remembering past events. MemoryWAM introduces a hybrid persistent memory system that lets robotic world action models recall long-term context without the crippling computational cost of storing every past frame.

What the Researchers Built

MemoryWAM is a dual-model architecture for long-horizon robotic manipulation that combines a video diffusion model (Video DiT) with a separate action diffusion model (Action DiT). The breakthrough is a three-tier memory system: a sliding window of recent observations, a set of "anchor frames" saved periodically to capture important transitions, and compact "gist tokens" that compress the entire history into a small memory footprint.

Unlike earlier world action models (WAMs) that either lack memory entirely or retain full history (which becomes prohibitively expensive over time), MemoryWAM maintains a fixed-size memory budget. During inference, the Video DiT processes only the current observation and updates the key-value (KV) cache with compressed historical context. The Action DiT then denoises action tokens while attending to this cached representation, enabling long-horizon reasoning without re-processing past frames.

The researchers tested MemoryWAM on both simulated environments and a real-world dual-arm robot (ARX arms with parallel grippers, using a RealSense D455 camera). The real-world tasks included a "Shell Game" where the robot must track a cup as it is swapped between positions, and a long-horizon pick-and-place sequence requiring memory of object locations.

MemoryWAM hybrid memory system with sliding window, anchor frames, and gist tokens

Key Results

MemoryWAM outperformed all baselines on memory-dependent manipulation tasks while achieving dramatically lower latency and GPU memory usage.

Simulation experiments: Policies using only a short observation window (no memory) failed on tasks requiring recall of events more than a few timesteps back. MemoryWAM solved these tasks reliably.
Real-world Shell Game: The robot had to track a cup as it was swapped at irregular intervals. The "LingBot-VA" baseline (which uses full history) had high inference latency—so high that it physically missed the cup swaps during execution, causing task failure. MemoryWAM succeeded with substantially lower latency.
GPU memory cost: MemoryWAM used significantly less GPU memory than the full-history LingBot-VA baseline, because it never stores every past frame.
Inference latency: Concrete numbers from the paper show LingBot-VA's high latency was a critical failure mode. MemoryWAM's hybrid approach kept latency low enough for real-time control.

The consistent trend across both simulation and real-world tests: memory is essential for non-Markovian tasks, but storing full history is inefficient. MemoryWAM's compressed persistent memory provides the best of both worlds.

How It Works

MemoryWAM separates the robot's understanding of the world (dynamics) from its action generation. The Video DiT extracts features from each new observation and updates a persistent KV cache. This cache stores three types of memory:

Sliding window: The most recent 4–8 frames for short-term temporal continuity.
Anchor frames: Selected frames at key moments (e.g., when a hand grasps an object) that are preserved indefinitely at low resolution.
Gist tokens: A learned compressed representation of everything else, produced by passing the video DiT's intermediate features through a small transformer that outputs a fixed number of tokens (e.g., 8 or 16).

Real-world dual-arm robot setup with ARX arms and Realsense camera for MemoryWAM experiments

During inference, the Action DiT predicts a chunk of future actions by denoising random action tokens. It attends to the cached video representations via cross-attention, so it can "see" both current and past context. The key innovation is that the video DiT only processes the current frame to update memory—it never re-encodes past frames.

The system processes a single new observation, updates the cache in one forward pass, and then samples actions. This is fundamentally different from approaches that stack all past observations and run the entire stack through a vision model each step.

Benchmark highlights (qualitative summary):

Task	Memoryless Baseline	Full-History Baseline	MemoryWAM
Shell Game (real)	Failed (no cup recall)	Failed (latency too high)	Success
Long-horizon pick-and-place (sim)	Failed after ∼20 steps	Worked but high latency	Success + low latency
GPU memory footprint	Low (but fails)	High (grows linearly)	Low and constant

No exact numerical table was provided in the paper text, but the pattern is clear: MemoryWAM solves memory-dependent tasks with practical compute.

Why This Matters for Robotics

Many real-world robot tasks—like assembly, cooking, or warehouse sorting—require remembering what happened minutes ago. Current state-of-the-art vision-language-action models (VLAs) often assume the environment is Markovian (i.e., only the latest image matters), which breaks down when objects disappear behind obstacles, tools are used and set down, or sequences have dependencies separated in time.

MemoryWAM's approach is especially relevant for humanoid robots and warehouse robots that operate in complex, dynamic environments. A humanoid that can remember where it put a tool ten minutes ago doesn't need to re-scan the environment constantly. A warehouse robot that can track inventory handoffs across multiple stations benefits from persistent memory without exploding compute costs.

The practical inference speed means MemoryWAM can run on current-generation GPUs in real-time, making it deployable on used industrial robots retrofitted with modern controllers. For companies running used cobots for assembly tasks with long sequences, this memory-efficient architecture could enable automation of tasks that previously required human oversight.

Limitations and Open Questions

MemoryWAM inherits the fundamental limitations of video diffusion models: they struggle with high-level semantic reasoning and abstract task planning. The paper suggests that future work could combine MemoryWAM's memory system with a "System 2" reasoning model (like large language models) to handle tasks requiring logic, math, or natural language understanding.

Another open question is scalability: how well does the gist token compression work for tasks lasting hours or days? The experiments covered minutes-long tasks. The anchor frame selection policy (when to save an anchor) is hardcoded; learning this selection online could improve generalisation.

Finally, MemoryWAM was tested only on a single dual-arm platform with parallel grippers. Deploying on different robot morphologies or with dexterous hands may require retuning the memory configuration.

Frequently Asked Questions

What makes MemoryWAM different from earlier world action models? Earlier models either had no memory (failing on long-horizon tasks) or stored every past frame (becoming slow and memory-heavy). MemoryWAM uses a hybrid approach with a fixed-size memory that compresses history into anchor frames and gist tokens.

Does MemoryWAM require special hardware? No—it ran on standard GPUs in the experiments. The memory design is software-only and compatible with any robot that uses camera images and joint-level action outputs.

What tasks is MemoryWAM best suited for? Tasks where the robot must remember events that happened more than a few seconds ago, such as object tracking (Shell Game), multi-step assembly with occluded items, or long pick-and-place sequences.

Can MemoryWAM be combined with a language model for instruction following? The paper mentions that as future work. The current model accepts a task description as conditioning, but does not integrate a separate language reasoning loop.

Conclusion

MemoryWAM solves a critical bottleneck in long-horizon robotic manipulation: how to remember the past without paying the full computational price. By combining a sliding window, anchor frames, and compressed gist tokens, it achieves superior performance on memory-dependent tasks with real-time inference speeds. This brings world action models one step closer to practical deployment in factories and homes.