January 31, 2026 Research

Dreamer-MC: A Real-Time Autoregressive World Model for Infinite Video Generation

Ming Gao*
Yan Yan
ShengQu Xi
Yu Duan
ShengQian Li
Feng Wang†
Create AI

A deep dive into the architecture behind our infinite video engine, achieving coherent open-world simulation at 20 FPS with sub-50ms latency and 256 frame context on single H200.

Figure 1: Infinite procedural generation from the Dreamer-MC prior.

We are thrilled to present our latest work: a real-time, infinite video generation model for Minecraft, developed as an adaptation of the DreamerV4 architecture. By leveraging this powerful foundation, we have achieved true autoregressive generation that streams indefinitely, efficiently managing temporal context to avoid the redundant re-computation of past frames. This system goes far beyond simple navigation; it captures the rich, emergent complexity of the Minecraft sandbox, seamlessly simulating intricate interactions—from the physics of archery and horse riding to the granular details of mining and consuming items.

Key Technical Innovations

Comparison of generation with and without x0 prediction
Figure 2: Without x0 prediction, standard autoregressive models attempting to maintain constant-time inference without recomputing history suffer from severe "drift," causing textures to melt into noise over long horizons.

The Challenge of Infinite Generation: Conquering Error Accumulation

One of the most persistent hurdles in autoregressive video generation is error accumulation. In a standard autoregressive setup, each new frame is generated based on the previous one; consequently, minor imperfections in early frames compound over time, leading to "drift." As the generation horizon extends, this drift manifests as severe geometric distortion, blurring, and significant color shifts, eventually causing the simulation to collapse into unintelligible noise.

Previous approaches have attempted to mitigate this with computationally expensive workarounds. Methods like Context as Memory effectively reduce drift by maintaining a bank of historical frames, but they often require re-computing or re-attending to past latents, which breaks the promise of constant-time inference. Other state-of-the-art attempts—such as Self-Forcing or Rolling Forcing—try to bridge the gap between training and inference. These methods typically involve fine-tuning a bidirectional model with "diffusion forcing" and then performing autoregressive distillation using Distribution Matching Distillation (DMD). While these techniques delay the onset of degradation, they are not immune to it; once the generation length significantly exceeds the training window, the distribution inevitably shifts, resulting in the familiar artifacts of "melting" structures and washed-out colors.

Our model solves this fundamentally by adopting the \( x_0 \) prediction objective introduced in DreamerV4. Instead of predicting the noise residual (\( \epsilon \)) at every step—which allows errors to accumulate in the latent space—the model is trained to predict the clean, original image state (\( x_0 \)) directly. This effectively "resets" the noise floor at each step, preventing microscopic errors from compounding. This architectural choice is the key to our model's ability to sustain crisp, coherent gameplay visuals indefinitely, without the need for expensive history re-computation.

Model Architecture

Unlike traditional video transformers that operate on VAE latents, our model learns a compact, semantic latent representation of the world. This allows us to separate the visual compression from the dynamics learning, achieving high-resolution output with manageable computational costs.

Causal Gather-Token Temporal Autoencoder
Figure 3: Causal Gather-Token Temporal Autoencoder. Learnable tokens aggregate information from image patches across time.
Dynamic model architecture
Figure 4: Dynamic model architecture with decomposed spatial-temporal attention.

1. Causal Gather-Token Tokenizer

To achieve a high-level semantic understanding of the Minecraft world, we designed a Causal Gather-Token Temporal Autoencoder (Figure 3). Instead of processing the entire image grid directly, we introduce a set of learnable Gather Tokens (\( G \)). These tokens act as the primary information bottleneck; they query and aggregate features from the masked image patches via cross-attention, compressing the visual input into a compact latent representation \( z \).

Crucially, to ensure stability in video generation, this process is not processed in isolation per frame. We incorporate Temporal Attention layers that allow the gather tokens to attend to the gather tokens of previous frames. This causal temporal link ensures that the latent representation remains consistent over time, significantly reducing flickering and ensuring that the semantic understanding of the world evolves smoothly.

2. Dynamic Model: Decomposed Spatial-Temporal Attention

Complex Interactions: Beyond Simple Navigation

Previous iterations of world models, including earlier Dreamer variants, were primarily benchmarked on simple locomotion tasks—essentially learning to "run" or "walk" through a static environment. While impressive, these models often struggled with the fine-grained manipulation and causal physics required for a true sandbox simulation.

Dreamer-MC significantly raises the bar by mastering a diverse array of complex, multi-step interactions. The model does not just predict camera movement; it accurately simulates the cause-and-effect of tool usage, navigation, and environmental modification.

01. Boating & Water Physics

02. Consuming Items

03. Archery & Projectiles

04. Fluid Dynamics (Bucket)

05. Mining & Breaking Blocks

06. Sleeping / Time Skip

07. Nether Portals & Teleportation

Inference Optimization

Achieving true real-time generation at high quality required moving beyond standard PyTorch implementations, with a heavily optimized inference pipeline designed for long-context workloads on a single H200 GPU.

Ring Buffer Mechanism
Figure 5: Ring Buffer mechanism for infinite context. By re-applying RoPE embeddings dynamically during inference, we maintain temporal coherence without recomputing history.

1. Infinite Context via Ring Buffer & RoPE Re-indexing

Standard sliding-window implementations often rely on shifting memory (e.g., torch.roll or memmove) to discard old frames and make room for new ones. This data movement incurs a significant \( O(N) \) overhead at every step.

We solve this using a Ring Buffer KV Cache. Instead of shifting data, we maintain a fixed-size buffer and simply advance a write pointer, overwriting the oldest frame with the newest one (\( O(1) \) complexity). However, this introduces a new challenge: the physical position of a frame in the buffer no longer matches its logical temporal position, breaking standard Rotary Positional Embeddings (RoPE).

To address this, we implement RoPE Re-indexing:

2. Removing CPU Overhead with CUDA Graphs

In autoregressive generation, the model executes thousands of small GPU kernels (e.g., LayerNorm, Attention, MLP) per frame. In a standard Python loop, the CPU overhead of dispatching these kernels often exceeds the actual GPU execution time, creating a severe bottleneck.

To eliminate this, we utilize CUDA Graphs. We capture the entire inference step—including the Ring Buffer update and RoPE application—into a static execution graph. This allows the GPU to launch the entire sequence of kernels autonomously, bypassing the Python/CPU bottleneck entirely. This optimization reduced our frame generation latency by over 40%, enabling high-resolution generation at steady interactive framerates.

Limitations & Future Work

While our model demonstrates the feasibility of real-time, infinite video generation, it remains a stepping stone toward a truly general-purpose world model. Several limitations currently exist that we plan to address in future iterations:

Citations

@article{hafner2025dreamerv4,
    title   = {Dreamer-MC: A Real-Time Autoregressive World Model for Infinite Video Generation},
    author  = {Ming Gao, Yan Yan, ShengQu Xi, Yu Duan, ShengQian Li, Feng Wang},
    year    = {2026},
    url     = {https://findlamp.github.io/dreamer-mc.github.io/}
}

References

@article{hafner2025dreamerv4,
    title   = {Training Agents Inside of Scalable World Models},
    author  = {Hafner, Danijar and Yan, Wilson and Lillicrap, Timothy},
    journal = {arXiv preprint arXiv:2509.24527},
    year    = {2025},
    url     = {https://arxiv.org/abs/2509.24527}
}
@article{chen2025maetok,
    title   = {Masked Autoencoders Are Effective Tokenizers for Diffusion Models},
    author  = {Chen, Hao and Zhang, Michael and Li, Yuxin and others},
    journal = {International Conference on Machine Learning (ICML)},
    year    = {2025},
    url     = {https://arxiv.org/abs/2502.03444}
}
@article{huang2025selfforcing,
    title   = {Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion},
    author  = {Huang, Xun and Li, Zhengqi and He, Guande and Zhou, Mingyuan and Shechtman, Eli},
    journal = {arXiv preprint arXiv:2506.08009},
    year    = {2025},
    url     = {https://arxiv.org/abs/2506.08009}
}
@article{liu2025rolling,
    title   = {Rolling Forcing: Autoregressive Long Video Diffusion in Real Time},
    author  = {Liu, Kunhao and Hu, Wenbo and Xu, Jiale and Shan, Ying and Lu, Shijian},
    journal = {arXiv preprint arXiv:2509.25161},
    year    = {2025},
    url     = {https://arxiv.org/abs/2509.25161}
}
@article{yu2025context,
    title   = {Context as Memory: Scene-Consistent Interactive Long Video Generation},
    author  = {Yu, Jiwen and Bai, Jianhong and Qin, Yiran and Liu, Quande and others},
    journal = {arXiv preprint arXiv:2506.03141},
    year    = {2025},
    url     = {https://arxiv.org/abs/2506.03141}
}

Create AI Team

Contact: dujinshidai30@gmail.com

```