Section 3 · the runtime objective

Flow matching.

How π₀.5 turns a fistful of random noise into a coherent second of robot motion in ten small steps.

The action expert has to produce fifty future actions, each eighteen-dimensional — nine hundred numbers describing the next second of behavior. The obvious thing to do is regress them. But that fails in a particular way: if two valid action chunks exist for the same scene — the robot could grab the mug from the left or from the right — mean-squared regression averages them, and the model produces an action chunk that points between the two grasps. That is a crash, not a behavior. Action distributions are multi-modal; their averages are not actions. Autoregressing nine hundred numbers one at a time is no better — at thirty milliseconds per token that is half a minute per chunk.

The trick is to learn which direction to move, not the chunk itself. Imagine the chunk as a single point in a nine-hundred-dimensional space. Now imagine a much simpler point: pure Gaussian noise, also a point in that space. The action expert learns a velocity field — a function that, at every point along a path between noise and a clean chunk, says "go this way." At runtime you start from a fresh noise sample and follow the field for a few small steps. A coherent chunk falls out the other end.

Figure 3.1 — Velocity field, two-dimensional toy

t 0.00

noise distribution target distribution velocity field arrows

Drag t from 0 to 1. At each time-stamp, a grid of teal arrows shows the velocity field. Sixty particles ride the field from the lower-left noise blob into the upper-right ring.

That is the whole picture, in two dimensions. Each particle starts in the noise blob at t = 0. At every point in space — and every value of t — there is an arrow that says where to go next. Following the arrows for a finite time delivers the particle into the ring. At training time the model is taught what those arrows are; at inference time it follows them.

The lift to robot actions is mechanical. Replace "point in 2D" with "fifty-by-eighteen action chunk," "noise blob" with a 900-dimensional Gaussian, and "ring" with the distribution of clean chunks the robot has seen humans demonstrate. Same machinery, much higher dimension.

Figure 3.2 — The same picture, in action space

t 0.00

joint 0 joint 1 joint 2 joint 3 joint 4 joint 5 joint 6 arm

Seven joint traces, each spanning fifty chunk timesteps. At t = 0 they are random. As t advances they resolve into the dashed target — and the schematic arm on the right snaps from a jittery cloud into a coherent reach.

Training the field. Pick a clean chunk x₁ from a human demonstration. Sample fresh Gaussian noise ε. Pick a random time-stamp t in [0, 1]. Build the in-between point on the straight line:

$$x_t = (1-t)\,\varepsilon \,+\, t\, x_1$$

and ask the network to predict the velocity along that line — which is just the constant x₁ − ε. The loss is mean-squared error against that target.

Conditional flow-matching loss

$$\mathcal{L}_{\text{flow}} \;=\; \mathbb{E}_{t,\,\varepsilon,\,x_1}\,\bigl\| v_\theta(x_t,\, t,\, c) - (x_1 - \varepsilon) \bigr\|^2$$

where c is the conditioning context — images and language read from the VLM through shared self-attention. That is the entire training-time objective for the action expert.

Inference: integrate the ODE. Start at a fresh noise sample x₀ ∼ N(0, I). Take ten forward-Euler steps of size δ = 0.1. At each step, evaluate the network and advance the chunk:

$$x_{t+\delta} \;=\; x_t \,+\, \delta \cdot v_\theta(x_t,\, t,\, c)$$

After ten steps you arrive at t = 1 and the chunk is clean. On an RTX 4090, all ten steps cost about twenty-seven milliseconds; the remainder of the seventy-three-millisecond budget goes to image encoding and the prefix forward pass.¹π₀.5 §D, Table I: image encoders 14 ms, prefix forward pass 32 ms, ten action-token forward passes 27 ms — 73 ms on-board.

Figure 3.3 — Same network, two control flows

ε x_t clean

Training (left) re-samples a random t each step and computes a residual against (clean − ε). Inference (right) starts at noise and walks t in order across ten Euler steps. The same v_θ is used in both.

Why ten steps? Because the chunk-quality curve has a knee there. One step is too few — the network has to map noise to a clean chunk in a single jump, and at high curvature it can't. Five steps gets close. Ten is the plateau. Twenty doubles the cost and adds nothing.

Figure 3.4 — Why π₀.5 picks ten

N steps 10

chunk quality wall-clock (ms) task success

Three curves, one slider. Quality plateaus near N = 10; wall-clock is linear; success rate steps up sharply between four and seven and then flattens.

Connection back to π₀.5. The action expert's job, at every step of integration, is exactly to compute v_θ(x_t, t, c). Its inputs are the current noisy chunk, the time-stamp t, and the context c from the vision-language model. The time-stamp is mixed into each action token's input projection via a sinusoidal embedding. Everything else in the architecture exists to produce a good velocity prediction in twenty-seven milliseconds.

One second of motion is, however, too short for a task like "clean the kitchen." The model needs a way to plan over longer horizons.

How the time-stamp is sampled during training (it is not uniform).

π₀.5 inherits π₀'s Beta(1.5, 1)-shaped sampler over τ ∈ [0, 0.999], which weights high-noise time-stamps more heavily than uniform.²π₀ §B "Sampling the flow matching timestep." The schedule is the opposite of Stable Diffusion 3, which emphasizes the middle of the noise range. The reasoning: at low noise the answer is essentially given by the conditioning, so there is little to learn there; at high noise the chunk is genuinely uncertain and the gradient is informative. The cap at 0.999 exists because inference uses ten steps of size 0.1 — the model never operates above 0.999 at runtime.

Why flow matching and not diffusion?

The two families share a generative-model lineage but train against different targets. Score-based diffusion learns a noise-prediction or score function under a multi-step noise schedule and integrates a stochastic differential equation at inference, typically over fifty to a thousand steps. Flow matching with linear (optimal-transport) paths is the special case where the path between noise and data is a straight line and the target velocity is a constant — making the ODE nearly straight and integrable in five-to-ten Euler steps. For π₀.5's inference budget that ten-times speedup is the entire reason flow was chosen.³Lipman et al. 2023 (Flow Matching for Generative Modeling); Liu et al. 2023 (Rectified Flow); Stable Diffusion 3 also adopted this family.

What happens to the prefix during the ten Euler steps?

The prefix — image patches, language tokens, and the proprioceptive state token — does not change as t walks from 0 to 1. The model computes the prefix's keys and values once and caches them. Each Euler step only re-runs the action-block forward pass over the fifty action tokens, attending to the cached prefix. That is what keeps the per-step cost at about 2.7 ms.