The Multimodal Mind
Perceive → Imagine → Act
The next frontier in AI is a single model that understands the physical world well enough to imagine its futures and act to shape them. Here is how I'd build one — the cognitive loop, the architecture, and the staged training recipe.
Ch. 01 The thesis
The next frontier in AI is a single model that understands the physical world well enough to imagine its futures and act to shape them. The path runs through a natively multimodal foundation model — one that perceives, reasons, and generates across all modalities (image, video, text, speech, action) — organized around a three-stage cognitive loop: Perceive → Imagine → Act.
Biological intelligence follows that loop without thinking about it: observe the world, mentally simulate what should happen, then intervene. Current AI breaks the loop into disconnected pipelines — a perception model here, a planner there, a controller bolted on the end, each trained by a different team on a different dataset. I propose building it as one model.
The rest of this page makes that claim concrete. We start with the loop, then the architecture that runs it (a Mixture of Transformers), then the part I want to expand on most: a staged pretraining recipe that grows the model one generative capability at a time, freezing what came before. We finish with the self-play flywheel and the unification of today's siloed problems.
Ch. 02 The cognitive loop
Three stages, run in a cycle. Each is inference in the same unified multimodal space — not three separate modules wired together.
- Perceive. The model ingests the current state across all modalities — video, audio, proprioception, language — into a unified representation. Perception is inference, not a preprocessing step.
- Imagine. Before acting, the model thinks in multimodal space. Not just text chain-of-thought, but a mental video of the planned trajectory: predicted proprioception, anticipated contact forces. This multimodal imagination is the world model.
- Act. The model generates whatever output the task requires — motor commands, keyboard/mouse, speech — conditioned on the imagined plan, not on reactive perception alone. The action changes the world, new percepts arrive, and the loop repeats.
Click a stage below to step through the loop, with the canonical example: a model deciding to pick up a glass.
The key idea hiding in the animation: Imagine and Act use the same generative machinery as Perceive. Imagining "the hand approaching, fingers closing, the weight shifting" is video + proprioception generation. Acting is action generation. They are not bolted-on heads — they are the same model generating in different output modalities. That is what the next chapter's architecture buys us.
Ch. 03 One architecture: Mixture of Transformers
If one model has to perceive and generate images, video, audio, and actions, a single dense network creates a problem: modalities interfere. The statistics of a video frame and a motor command have nothing in common, and forcing the same feed-forward weights to model both means each capability dilutes the other.
The architecture I use is a Mixture of Transformers (MoT). The trick is to split the network by modality:
- Every weight is modality-specific. Each modality gets its own copy of the entire transformer block — the query/key/value and output projections, the feed-forward network, and the layernorms. Those weights are the experts. A token's modality deterministically selects which expert processes it; there is no learned router.
- Only the attention operation is global. Each token's Q, K, and V are produced by its own modality's projections, but attention is then computed causally over the combined sequence — each token attends to every preceding token regardless of modality, so a later video token attends to earlier text and action tokens. The attention operation has no weights of its own; it is the wire that makes this one model, while every learnable parameter stays modality-specific.
The payoff shows up in the next chapter: because every weight a token touches lives in its modality's expert, you can freeze an entire expert — its attention projections included — while training another, and global attention still lets the new expert read everything the frozen ones know.
Toggle modalities below to see how a mixed-modality sequence flows through shared attention and out to its experts.
Notice that the attention operation never changes — only which modality experts (and therefore which Q/K/V, output, FFN, and norm weights) are present and active. Drop video and audio and you have a vision-language understanding model. Add the action expert and the same backbone becomes an agent. One model, reconfigured by which experts are present and active.
Ch. 04 The pretraining recipe: a staged curriculum
The recipe mirrors how we train LLMs — next-token prediction at internet scale — but extends every stage into multimodal space, and crucially, it is staged. You do not learn to perceive, imagine, and act all at once. Generation is harder than understanding, and each generative capability builds on the one before it. So pretraining climbs a ladder:
Two rules govern the climb:
- Add one expert per stage. Each new stage introduces a new MoT generation expert (Ch. 3) on top of everything learned so far.
- Freeze everything earlier. When training a later generation component, the earlier components are frozen. Only the newest expert learns.
There is a third reason the recipe is staged, and it may be the most practical one: data gets scarcer at every stage. The corpus of text, images, video, and audio for understanding is internet-scale. Paired data for high-quality image generation is smaller; video and audio generation smaller still; and action data — robot trajectories, labeled screen recordings, driving logs — is the scarcest of all. Each stage trains on roughly an order of magnitude less data than the one before. So we spend the abundant data first to build the strongest possible backbone, then climb into the data-poor regimes on top of it.
Step through the four stages below. Watch which experts are training, which are frozen, and which haven't been added yet. The shared attention is along for the ride the whole time.
Why this order?
The sequence is not arbitrary — each arrow is a dependency:
- Understanding before generation. You cannot imagine a coherent future until you can perceive the present. Stage 1 builds the perceptual prior that everything conditions on.
- Image before video. A video is images over time. The video expert reuses the frozen image expert's spatial priors and only has to learn temporal dynamics — a much smaller lift than learning pixels and motion together.
- Generation before action. Action is conditioned on imagination (Ch. 2). The model must be able to roll out a predicted video/proprioception trajectory before it can choose a motor command that achieves it. So the action expert comes last, reading from a frozen world model.
Why freeze?
Freezing earlier components does three things at once:
- No catastrophic forgetting. Hard-won perception and image priors can't be eroded by the noisier gradients of a new, harder generative task. The capability you already paid for is locked in. This matters most precisely because of the data pyramid: if everything trained jointly, the tiny action dataset would be swamped by — and would perturb — the enormous understanding corpus. Freezing lets the data-poor expert learn from a strong backbone without putting the data-rich capabilities at risk.
- Clean credit assignment. The newest expert is the only thing learning, so every gradient is attributable to it. Nothing else can drift to "explain away" its errors.
- Cheaper training. A frozen expert is just a forward pass — no optimizer state, no backward pass through its weights. Each later stage is far cheaper than training the whole stack jointly.
Because every weight — attention projections included — lives inside a modality expert, freezing an expert freezes its Q/K/V too. The global attention operation has no parameters of its own; it simply lets each newly added expert read and write through the frozen experts' representations.
The learning-rate schedule
All of this rides on a single learning-rate schedule with three phases: a short warm-up that ramps the LR to its peak, a long constant-LR (CLR) plateau that does the bulk of the work, and a ramp-down that anneals the LR toward zero. The staged curriculum maps directly onto it. The first 50% of the CLR plateau is understanding only — the data-rich foundation — and the generation stages (image, then video/audio, then action) are layered into the second half. The ramp-down is reserved for high-quality data across every modality, polishing the whole stack as the LR decays to zero.
Scrub through a full pretraining run below — or hit play. The shaded bands show which data is mixed in; the readout tracks the LR, the data, and which expert is learning at that point.
Ch. 05 Post-training & the physics flywheel
Pretraining learns the prior over how the multimodal world works. Two stages turn that prior into a capable agent:
- Post-training (SFT). Supervised fine-tuning on curated Perceive → Imagine → Act demonstrations, with explicit multimodal reasoning traces — the imagination made legible.
- RL via self-play in simulation with automatic reward signals. This is where the model becomes self-improving.
In simulation, the model runs its full loop and grades itself automatically on three questions:
- Did what I imagined actually happen? — imagination accuracy.
- Did I succeed? — task completion.
- Does my prediction violate physics? — consistency, checked against engine constraints. Call it RL from physics feedback (RLPF).
Trace one lap to see why it compounds. A better world model makes the model's imagination more accurate, so it can plan against a mental simulation it trusts — better plans. Better plans yield better actions, which succeed more often. Those successful rollouts ("this plan led to this outcome") are exactly the labeled examples you want, so they become richer training data — which trains an even better world model. Each lap raises the floor for the next.
Text models can self-improve this way too — but only where an automatic grader exists. Math has checkable answers, code has unit tests, and RL from verifiable rewards has driven much of the recent progress in exactly those domains. Two things limit it. First, breadth: verifiable graders cover a narrow slice of what we care about, and everything else falls back on learned proxies — reward models, LLM judges — which are themselves models that can be reward-hacked and carry no ground truth about the physical world. Second, what the signal improves: a unit test sharpens a policy, but it doesn't teach the model how the world works. Physics fixes both. It is a broad, grounded grader across the entire embodied space — did the glass actually lift? did the motion obey gravity? — and that signal flows back into the world model itself. Physics is to physical reality what unit tests are to code.
The catch is that the whole loop runs inside a simulation, so its ceiling is simulation fidelity. Train against a simulator that gets physics wrong and the model learns from a lie — the rewards are noise and nothing compounds (exactly what the fidelity slider above shows). And neither obvious option is enough on its own: a hand-built physics engine is accurate but can't render the visual richness of the real world, while real video is realistic but you can't act inside it. The essential ingredient is therefore a learned world simulator — trained on real video for realism and interactivity, and corrected by physics engines for physical consistency.
Drag the fidelity slider to see how it gates the compounding rate.
Low fidelity and the flywheel slips: the model learns from a simulation that lies about physics, so capability barely compounds. High fidelity and each cycle multiplies the last. The whole strategy lives or dies on the quality of the imagined world.
Ch. 06 One model, many problems
Here is the payoff of building it as one model. Every major AI capability challenge is a different route through the same architecture — a particular choice of input modalities and output modality. Pick inputs and an output below to see which of today's siloed research problems you've just described. Familiar combinations name an established task; the rest are tagged potential application — routes the same model could serve, with no household name yet.
Today each of these is a separate model, dataset, and research community. The Perceive → Imagine → Act framework unifies them as different routes through one architecture — and improvements to the shared backbone benefit every task simultaneously. That is the whole bet: build the multimodal mind once, and every capability comes along for the ride.