Flow matching.
How π₀.5 turns a fistful of random noise into a coherent second of robot motion in ten small steps.
The action expert has to produce fifty future actions, each eighteen-dimensional — nine hundred numbers describing the next second of behavior. The obvious thing to do is regress them. But that fails in a particular way: if two valid action chunks exist for the same scene — the robot could grab the mug from the left or from the right — mean-squared regression averages them, and the model produces an action chunk that points between the two grasps. That is a crash, not a behavior. Action distributions are multi-modal; their averages are not actions. Autoregressing nine hundred numbers one at a time is no better — at thirty milliseconds per token that is half a minute per chunk.
The trick is to learn which direction to move, not the chunk itself. Imagine the chunk as a single point in a nine-hundred-dimensional space. Now imagine a much simpler point: pure Gaussian noise, also a point in that space. The action expert learns a velocity field — a function that, at every point along a path between noise and a clean chunk, says "go this way." At runtime you start from a fresh noise sample and follow the field for a few small steps. A coherent chunk falls out the other end.
That is the whole picture, in two dimensions. Each particle starts in the noise blob at t = 0. At every point in space — and every value of t — there is an arrow that says where to go next. Following the arrows for a finite time delivers the particle into the ring. At training time the model is taught what those arrows are; at inference time it follows them.
The lift to robot actions is mechanical. Replace "point in 2D" with "fifty-by-eighteen action chunk," "noise blob" with a 900-dimensional Gaussian, and "ring" with the distribution of clean chunks the robot has seen humans demonstrate. Same machinery, much higher dimension.
Training the field. Pick a clean chunk x₁ from a human demonstration. Sample fresh Gaussian noise ε. Pick a random time-stamp t in [0, 1]. Build the in-between point on the straight line:
and ask the network to predict the velocity along that line — which is just the constant x₁ − ε. The loss is mean-squared error against that target.
where c is the conditioning context — images and language read from the VLM through shared self-attention. That is the entire training-time objective for the action expert.
Inference: integrate the ODE. Start at a fresh noise sample x₀ ∼ N(0, I). Take ten forward-Euler steps of size δ = 0.1. At each step, evaluate the network and advance the chunk:
After ten steps you arrive at t = 1 and the chunk is clean. On an RTX 4090, all ten steps cost about twenty-seven milliseconds; the remainder of the seventy-three-millisecond budget goes to image encoding and the prefix forward pass.¹π₀.5 §D, Table I: image encoders 14 ms, prefix forward pass 32 ms, ten action-token forward passes 27 ms — 73 ms on-board.
Why ten steps? Because the chunk-quality curve has a knee there. One step is too few — the network has to map noise to a clean chunk in a single jump, and at high curvature it can't. Five steps gets close. Ten is the plateau. Twenty doubles the cost and adds nothing.
Connection back to π₀.5. The action expert's job, at every step of integration, is exactly to compute vθ(xt, t, c). Its inputs are the current noisy chunk, the time-stamp t, and the context c from the vision-language model. The time-stamp is mixed into each action token's input projection via a sinusoidal embedding. Everything else in the architecture exists to produce a good velocity prediction in twenty-seven milliseconds.
One second of motion is, however, too short for a task like "clean the kitchen." The model needs a way to plan over longer horizons.
How the time-stamp is sampled during training (it is not uniform).
π₀.5 inherits π₀'s Beta(1.5, 1)-shaped sampler over τ ∈ [0, 0.999], which weights high-noise time-stamps more heavily than uniform.²π₀ §B "Sampling the flow matching timestep." The schedule is the opposite of Stable Diffusion 3, which emphasizes the middle of the noise range. The reasoning: at low noise the answer is essentially given by the conditioning, so there is little to learn there; at high noise the chunk is genuinely uncertain and the gradient is informative. The cap at 0.999 exists because inference uses ten steps of size 0.1 — the model never operates above 0.999 at runtime.
Why flow matching and not diffusion?
The two families share a generative-model lineage but train against different targets. Score-based diffusion learns a noise-prediction or score function under a multi-step noise schedule and integrates a stochastic differential equation at inference, typically over fifty to a thousand steps. Flow matching with linear (optimal-transport) paths is the special case where the path between noise and data is a straight line and the target velocity is a constant — making the ODE nearly straight and integrable in five-to-ten Euler steps. For π₀.5's inference budget that ten-times speedup is the entire reason flow was chosen.³Lipman et al. 2023 (Flow Matching for Generative Modeling); Liu et al. 2023 (Rectified Flow); Stable Diffusion 3 also adopted this family.
What happens to the prefix during the ten Euler steps?
The prefix — image patches, language tokens, and the proprioceptive state token — does not change as t walks from 0 to 1. The model computes the prefix's keys and values once and caches them. Each Euler step only re-runs the action-block forward pass over the fifty action tokens, attending to the cached prefix. That is what keeps the per-step cost at about 2.7 ms.