Section 2 · the action expert

The other transformer.

A robot does not output words. It outputs joint angles, gripper positions, base velocities — and lots of them, fast. π₀.5's solution is a second, smaller transformer that shares attention with the language model but runs its own weights.

A language model emits one of about 256,000 possible word-pieces at each step. A robot has no such vocabulary. To move the arm of a UR20 + Robotiq gripper through one second of motion at 50 Hz, the model has to produce 50 timesteps, each holding 18 floating-point numbers: seven joint targets per arm, two grippers, plus optional torso and base. ¹

D = 18 is the largest configuration in the data; smaller robots get zero-padded so the same model can drive any of them.

That makes the output of one chunk a vector in $\mathbb{R}^{50 \times 18} = \mathbb{R}^{900}$, with the constraint that consecutive timesteps form a smooth motion. Discrete word-piece prediction is the wrong tool; we need a head that produces real-valued, temporally coherent trajectories.

Predicting one timestep at a time would be cheaper. It would also be worse. A single 1/50-second action carries almost no commitment about where the gripper is going; the model would have to re-decide on every tick and the resulting motion would jitter. Predicting a full second at once forces the model to commit to a coherent trajectory and lets a downstream controller interpolate between chunks smoothly.

Figure 2.1 — One second of motion is one chunk
action chunk window joint 1 2 3 4 5 6 gripper
An action chunk plotted as seven joint traces over fifty 50-Hz steps; sweep the slider to see the chunk reveal itself, with a small arm graphic tracking the current joint targets.

The trick: same attention, two parallel weight banks.

You could solve the problem by bolting a second transformer next to the language model and connecting the two with cross-attention layers — expensive and architecturally fussy. You could bolt a small MLP head on top and ask it to produce all 900 numbers in one shot — cheap, but throws away the transformer's ability to let action timesteps reason about each other.

π₀.5 takes a third path. The model is one transformer. Its attention layer is shared. But it carries two parallel sets of feed-forward weights: one trained on text and image tokens (the language model side, marked in purple), one trained on action tokens (the action expert, marked in amber). A token's type — text/image or action — deterministically picks which feed-forward bank it flows through. ²

π₀ §B calls this “a single transformer with two sets of weights, also known as experts; the weights interact only through the transformer's self-attention layers.” π₀.5 inherits the construction.

Inside the shared attention, action queries see image and language keys; image queries see only image and language keys (the prefix-LM mask of page 1 still applies). One Q·K·V product, two parameter sets touching it. People call this a mixture of transformer experts. Routing is by token type, not learned.

Figure 2.2 — Shared attention, two MLP banks
VLM weights / route action expert weights / route shared attention
One transformer block. Attention is one shared box; the two MLP banks live below it as parallel weight tiles. A token's type chooses which tile it routes through, both in attention's QKV projections and in the feed-forward.

What the action expert costs.

The vision-language backbone is the heavy half: about 3.0 billion parameters, dominated by an 18-layer Gemma-2B language model and a SigLIP image tower. The action expert is the lighter half: about 0.3 billion parameters — roughly 10% the size — with the same depth (18 layers) and the same number of attention heads (so the shared attention computation aligns) but a narrower MLP and a narrower hidden width.

The asymmetry is deliberate. At inference the language backbone runs once per chunk; the action expert runs ten times per chunk (the ten flow-matching steps of page 3). Per-step compute matters more for the expert than for the backbone, so the expert is the smaller of the two.

Figure 2.3 — π₀.5 in one diagram
hover any block for its role and parameter count
VLM (~3.0B) action expert (~0.3B) flow head & gradient subtask preview (page 4)
Image and language tokens enter the VLM block; action tokens enter the same block through the parallel weight bank; their representations meet in shared attention; the action expert's outputs feed a flow head that produces the 50×18 chunk a robot consumes.

The action expert produces representations. It does not, on its own, produce actions. Each of its output tokens is a vector that needs one more step to become a joint angle: a head that maps that vector to a velocity in action space, and an integrator that walks ten steps along that velocity to land at a clean chunk.

That last step is flow matching, and it is the next page.

Why share the attention layer at all?

Two reasons. First, sharing is what lets action tokens read VLM features in the same forward pass — no separate cross-attention needed. Second, attention is most of the compute when the prefix is long (≈200 image patches plus a language prompt). Doubling the attention layer would roughly double the inference cost; doubling the much smaller MLP is cheap. The action expert's MLP is also narrower than the VLM's (~1024 wide vs 2048 wide), so even the “extra” bank is a fraction of a full transformer.