Hierarchical inference.
π₀.5 plans by talking to itself. Before each second of motion, the same VLM decodes a short subtask string in plain English — and conditions the next chunk on what it just said.
An action chunk is one second of motion. "Clean the kitchen" is two minutes of motion at minimum. Something has to bridge the gap between a goal stated in language and a one-second window of joint angles.
The bridge is to ask the VLM what to do next, in language. At every chunk boundary — about once a second — the VLM autoregressively decodes a short string: "pick up the sponge". That string then becomes part of the conditioning for the next chunk. The action expert flow-matches a one-second trajectory under that string, the robot executes it, and at the next boundary the VLM emits another string: "place on counter". Two clocks running side by side.
The asymmetry is the whole point. The high-level loop runs at about one hertz — slow, deliberative, generating tokens. The low-level loop runs at chunk-rate — fast, all fifty actions decoded in parallel by ten Euler steps of flow. The robot's controller runs at fifty hertz, executing actions out of the most-recent chunk one at a time. Each layer is doing the kind of work it is good at, at the rate that work demands.
Why language as the bottleneck? The natural alternative is to pass a learned vector — some compressed latent representation of "what to do next" — between the planning layer and the acting layer. Both architectures look similar on a whiteboard.
Language wins for two reasons that are hard to engineer in any other way. First, the VLM's language head already knows compositional structure: it has seen "the red mug" and "from the counter" and "behind the toaster" combined a billion ways during text pretraining. A new subtask string the model has never literally produced before — say, "pick up the blue spatula behind the kettle" — still parses into familiar pieces. Compositional generalization is inherited, not learned. A learned vector latent gets none of that for free; you would have to train it explicitly, on every combination, to get the same behavior.
Second, language is interpretable for debugging. When the robot does something wrong, you can read what it thought it was doing. If a kitchen-cleaning rollout produces the string "wipe the counter" while the robot is hovering over a stack of plates, you have a precise locus for the failure: the high-level planner picked the wrong subtask, not the low-level flow that executed it. With a vector latent, the same failure looks like noise.
The cost is fragility to phrasing. The π₀.5 paper documents swings of zero to one hundred percent on the same physical task when the high-level prompt is reworded.¹π₀.5 §4. Independent evaluation by Penn GRASP reports 0% on novel glass and unseen backgrounds, with prompt-rewording sensitivity as a frequent failure mode. A model that listens to language is a model that can be confused by the words it is listening to. The same property that buys compositional generalization also buys this brittleness.
All of that is still runtime. The hard part is what kind of training recipe produces a VLM that can both (a) decode a useful subtask string from a kitchen photo and (b) hand its features to an action expert without those features being ruined by the action loss. That is the next page.
Where exactly does the subtask string enter the action expert's conditioning?
The subtask string is decoded by the VLM at the chunk boundary and then concatenated into the same prefix of language tokens that the action expert reads via shared self-attention. From the action expert's point of view there is no separate "subtask channel" — the words are simply additional language tokens in the prefix, indistinguishable in mechanism from the original task prompt. The model has been trained to treat the most-recently-emitted subtask string as the operative one.
How is the high-level head trained to emit useful subtasks?
The π₀.5 data mix includes about eleven percent of high-level annotations in which a human watched a long-horizon rollout and labeled what subtask was happening at each chunk boundary. The same VLM is trained to predict those labels by ordinary next-token language-modeling on the labels themselves. That supervision is what gives the language head the prior over robot-relevant subtasks. The paper's own ablations show this verbal-instruction stream is one of the two active levers behind π₀.5's generalization, alongside embodiment diversity in the action data; web VQA was found to be statistically insignificant on the main tasks.
What happens at a chunk boundary in detail?
Sequence: (a) the controller has just consumed the last action of the previous chunk; (b) fresh camera frames and proprioception are fed to the VLM, which autoregressively decodes a short subtask string — about a hundred milliseconds for a half-dozen tokens on a modern GPU; (c) that string is appended to the language prefix; (d) the action expert samples a fresh noise chunk and runs ten Euler steps of flow matching, conditioned on the new prefix; (e) the controller starts consuming the new chunk at fifty hertz. The high-level decode and the low-level flow happen in series at each boundary, but only the flow runs between boundaries.