The Multimodal Mind
Perceive → Imagine → Act

The next frontier in AI is a single model that understands the physical world well enough to imagine its futures and act to shape them. Here is how I'd build one — the cognitive loop, the architecture, and the staged training recipe.

Ch. 01 The thesis

The next frontier in AI is a single model that understands the physical world well enough to imagine its futures and act to shape them. The path runs through a natively multimodal foundation model — one that perceives, reasons, and generates across all modalities (image, video, text, speech, action) — organized around a three-stage cognitive loop: Perceive → Imagine → Act.

Biological intelligence follows that loop without thinking about it: observe the world, mentally simulate what should happen, then intervene. Current AI breaks the loop into disconnected pipelines — a perception model here, a planner there, a controller bolted on the end, each trained by a different team on a different dataset. I propose building it as one model.

Every major AI capability challenge today — controllable video, computer use, robotics, world simulation, autonomous driving, creative tools — is a special case of multimodal conditional generation. One architecture subsumes them all. Improvements to the shared backbone benefit every task at once.

The rest of this page makes that claim concrete. We start with the loop, then the architecture that runs it (a Mixture of Transformers), then the part I want to expand on most: a staged pretraining recipe that grows the model one generative capability at a time, freezing what came before. We finish with the self-play flywheel and the unification of today's siloed problems.


Ch. 02 The cognitive loop

Three stages, run in a cycle. Each is inference in the same unified multimodal space — not three separate modules wired together.

  1. Perceive. The model ingests the current state across all modalities — video, audio, proprioception, language — into a unified representation. Perception is inference, not a preprocessing step.
  2. Imagine. Before acting, the model thinks in multimodal space. Not just text chain-of-thought, but a mental video of the planned trajectory: predicted proprioception, anticipated contact forces. This multimodal imagination is the world model.
  3. Act. The model generates whatever output the task requires — motor commands, keyboard/mouse, speech — conditioned on the imagined plan, not on reactive perception alone. The action changes the world, new percepts arrive, and the loop repeats.

Click a stage below to step through the loop, with the canonical example: a model deciding to pick up a glass.

Fig. 2.1The Perceive → Imagine → Act loop. Click a node or step through it.
Stage
Perceive

The key idea hiding in the animation: Imagine and Act use the same generative machinery as Perceive. Imagining "the hand approaching, fingers closing, the weight shifting" is video + proprioception generation. Acting is action generation. They are not bolted-on heads — they are the same model generating in different output modalities. That is what the next chapter's architecture buys us.


Ch. 03 One architecture: Mixture of Transformers

If one model has to perceive and generate images, video, audio, and actions, a single dense network creates a problem: modalities interfere. The statistics of a video frame and a motor command have nothing in common, and forcing the same feed-forward weights to model both means each capability dilutes the other.

The architecture I use is a Mixture of Transformers (MoT). The trick is to split the network by modality:

The payoff shows up in the next chapter: because every weight a token touches lives in its modality's expert, you can freeze an entire expert — its attention projections included — while training another, and global attention still lets the new expert read everything the frozen ones know.

Toggle modalities below to see how a mixed-modality sequence flows through shared attention and out to its experts.

Fig. 3.1MoT forward pass: shared attention, per-modality experts. Toggle the modalities present in the sequence.
Input sequence (tokens, interleaved)
↓   per-modality Q / K / V projection   ↓
Global Causal Self-Attention  ·  each token attends to all preceding tokens (no weights of its own)
↓   per-modality output projection · FFN · norms   ↓

Notice that the attention operation never changes — only which modality experts (and therefore which Q/K/V, output, FFN, and norm weights) are present and active. Drop video and audio and you have a vision-language understanding model. Add the action expert and the same backbone becomes an agent. One model, reconfigured by which experts are present and active.


Ch. 04 The pretraining recipe: a staged curriculum

The recipe mirrors how we train LLMs — next-token prediction at internet scale — but extends every stage into multimodal space, and crucially, it is staged. You do not learn to perceive, imagine, and act all at once. Generation is harder than understanding, and each generative capability builds on the one before it. So pretraining climbs a ladder:

MM understanding → image generation → video / audio generation → action generation

Two rules govern the climb:

There is a third reason the recipe is staged, and it may be the most practical one: data gets scarcer at every stage. The corpus of text, images, video, and audio for understanding is internet-scale. Paired data for high-quality image generation is smaller; video and audio generation smaller still; and action data — robot trajectories, labeled screen recordings, driving logs — is the scarcest of all. Each stage trains on roughly an order of magnitude less data than the one before. So we spend the abundant data first to build the strongest possible backbone, then climb into the data-poor regimes on top of it.

Step through the four stages below. Watch which experts are training, which are frozen, and which haven't been added yet. The shared attention is along for the ride the whole time.

Fig. 4.1The staged pretraining curriculum. Each stage trains one new expert and freezes the rest.
Global Causal Self-Attention  ·  the weightless operation that lets each token attend to all preceding tokens, across every active expert
training frozen not yet added

Why this order?

The sequence is not arbitrary — each arrow is a dependency:

Why freeze?

Freezing earlier components does three things at once:

Because every weight — attention projections included — lives inside a modality expert, freezing an expert freezes its Q/K/V too. The global attention operation has no parameters of its own; it simply lets each newly added expert read and write through the frozen experts' representations.

The learning-rate schedule

All of this rides on a single learning-rate schedule with three phases: a short warm-up that ramps the LR to its peak, a long constant-LR (CLR) plateau that does the bulk of the work, and a ramp-down that anneals the LR toward zero. The staged curriculum maps directly onto it. The first 50% of the CLR plateau is understanding only — the data-rich foundation — and the generation stages (image, then video/audio, then action) are layered into the second half. The ramp-down is reserved for high-quality data across every modality, polishing the whole stack as the LR decays to zero.

Scrub through a full pretraining run below — or hit play. The shaded bands show which data is mixed in; the readout tracks the LR, the data, and which expert is learning at that point.

Fig. 4.2The pretraining LR schedule, with the staged data mix laid over it.
30%
Constant LR
LR: 100% of peak
Data:
Training:
The staged-and-frozen recipe turns one terrifying joint-training problem into four tractable ones. Each stage asks a single, well-posed question — given everything I already know, how do I generate this one new modality? — and answers it without disturbing the rest.

Ch. 05 Post-training & the physics flywheel

Pretraining learns the prior over how the multimodal world works. Two stages turn that prior into a capable agent:

In simulation, the model runs its full loop and grades itself automatically on three questions:

Trace one lap to see why it compounds. A better world model makes the model's imagination more accurate, so it can plan against a mental simulation it trusts — better plans. Better plans yield better actions, which succeed more often. Those successful rollouts ("this plan led to this outcome") are exactly the labeled examples you want, so they become richer training data — which trains an even better world model. Each lap raises the floor for the next.

Text models can self-improve this way too — but only where an automatic grader exists. Math has checkable answers, code has unit tests, and RL from verifiable rewards has driven much of the recent progress in exactly those domains. Two things limit it. First, breadth: verifiable graders cover a narrow slice of what we care about, and everything else falls back on learned proxies — reward models, LLM judges — which are themselves models that can be reward-hacked and carry no ground truth about the physical world. Second, what the signal improves: a unit test sharpens a policy, but it doesn't teach the model how the world works. Physics fixes both. It is a broad, grounded grader across the entire embodied space — did the glass actually lift? did the motion obey gravity? — and that signal flows back into the world model itself. Physics is to physical reality what unit tests are to code.

The catch is that the whole loop runs inside a simulation, so its ceiling is simulation fidelity. Train against a simulator that gets physics wrong and the model learns from a lie — the rewards are noise and nothing compounds (exactly what the fidelity slider above shows). And neither obvious option is enough on its own: a hand-built physics engine is accurate but can't render the visual richness of the real world, while real video is realistic but you can't act inside it. The essential ingredient is therefore a learned world simulator — trained on real video for realism and interactivity, and corrected by physics engines for physical consistency.

Drag the fidelity slider to see how it gates the compounding rate.

Fig. 5.1The self-play flywheel. Simulation fidelity gates how fast capability compounds.
Self-play cycles
0
Relative capability
1.0×
0.60

Low fidelity and the flywheel slips: the model learns from a simulation that lies about physics, so capability barely compounds. High fidelity and each cycle multiplies the last. The whole strategy lives or dies on the quality of the imagined world.


Ch. 06 One model, many problems

Here is the payoff of building it as one model. Every major AI capability challenge is a different route through the same architecture — a particular choice of input modalities and output modality. Pick inputs and an output below to see which of today's siloed research problems you've just described. Familiar combinations name an established task; the rest are tagged potential application — routes the same model could serve, with no household name yet.

Fig. 6.1Multimodal conditional generation. Choose inputs and an output; read off the task.
Inputs (Perceive) — choose one or more
Output (Act) — choose one

Today each of these is a separate model, dataset, and research community. The Perceive → Imagine → Act framework unifies them as different routes through one architecture — and improvements to the shared backbone benefit every task simultaneously. That is the whole bet: build the multimodal mind once, and every capability comes along for the ride.