Page 2 — π₀: A VLM with a Flow-Matching Action Head
Where you are. Page 1 handed you the substrate: multi-head attention, mixture-of-transformer-experts routing, the prefix-LM mask, KV cache. This page builds π₀ out of those parts. You will see one transformer, two sets of MLP weights, a shared self-attention layer, and a 300M action expert reading PaliGemma features through that shared layer to emit a continuous action chunk via flow matching.
π₀ is a 3.3B-parameter VLA. Two-thirds of that mass is a PaliGemma VLM (SigLIP-So400m vision encoder + Gemma-2B language model, ~3.0B params total); the rest is a 300M action expert trained from scratch. [π₀ paper, arXiv 2410.24164] The two are not separate transformers with cross-attention. They are two sets of weights inside one transformer with token-wise routing — image/text tokens flow through the PaliGemma weights, state and action tokens through the action-expert weights, and the two interact only inside the shared self-attention layer. The runtime output is a 50-step continuous action chunk produced by 10 Euler steps of a flow-matching ODE.
Sources: π₀ paper (arXiv 2410.24164); FAST paper (arXiv 2501.09747).
P1 — Pixels become tokens (PaliGemma)
A 224×224 image becomes a sequence of patch tokens; text tokens get appended; the LM sees a flat sequence.
SigLIP-So400m takes the image, gridded into 14×14 = 196 patch tokens at the default 224-pixel input. Each patch is projected to the LM's embedding dimension by a linear projector. The text prompt (“pick up the red cube”) tokenizes into a handful of word-piece chips. They concatenate into one flat sequence, and the LM never knows which tokens were originally pixels.
Two practical details. First, image tokens dominate the prefix length — 196 image tokens to maybe 6–12 text tokens. Second, both groups are bidirectional with each other under PaliGemma's prefix-LM mask. A pixel patch in the bottom-left can attend to the word “cube” and vice versa. That's why you needed M1 from Page 1 in your head before now.
Three blocks, in order: [image, text] → [state] → [action]. Bidirectional within, causal between.
Three blocks in sequence. Within each block: full bidirectional attention. Between blocks: causal — later block attends to earlier, earlier cannot attend to later. The VLM block sits first, so PaliGemma's pretraining distribution is preserved (no future-context leakage). State qt is its own block so its KV cache survives all 10 flow-matching steps without recomputation. The action block Atτ attends to everything earlier — this is how the action expert “reads” VLM features. [π₀ §B, “Attention mask”]
The right margin tells you which expert each token belongs to. Image and text tokens flow through the PaliGemma weights; state and action tokens flow through the action-expert weights. The mixing happens inside the shared attention layer, not in a separate cross-attention. When an action token computes its query, it uses action-expert WQ; when it reads VLM keys/values, those K and V were produced by PaliGemma WK, WV. One Q·K·V product, two parameter sets touching it.
“π₀ is implemented as a single transformer with two sets of weights (also known as experts), where each token is routed to one of the experts; the weights interact only through the transformer's self-attention layers.” [π₀ §B, “Action expert”]
Sources: π₀ paper §B (mixture-of-experts attention).
D1 — π₀ architecture, one page
The full pipeline: image + text → PaliGemma backbone → action expert → flow head → action chunk.
The two experts share depth (18 layers), head count (18), and head dimension (256), because attention is shared and the heads have to align. They differ in width and MLP dim: PaliGemma is width = 2048, mlp_dim = 16384; the action expert is width = 1024, mlp_dim = 4096. [π₀ §B] The down-sized expert is a deliberate choice. Inference re-runs the action expert once per Euler step (10 times per chunk), so the per-step compute matters more than for the VLM, which only forwards the prefix once and caches.
State qt is zero-padded to the largest configuration in the dataset (D = 18: two arms × 7 DOF + 2 grippers + mobile base + torso). Smaller robots get padded; the model always operates on width-18 vectors. [π₀ §V-A] Each noisy action at′τ embeds via a 3-layer MLP that mixes the action vector with a sinusoidal encoding of the flow-matching timestep τ — so τ arrives per-token through the input MLP, not via AdaLN.
Sources: π₀ paper §IV, §B (architecture and parameter counts).
F1 — From noise to action chunk
Flow matching denoises a Gaussian sample into a clean action chunk over 10 small ODE steps along a learned vector field.
The training objective is conditional flow matching with the optimal-transport / linear-Gaussian probability path:
Lτ(θ) = E ‖ vθ(Atτ, ot) − (At − ε) ‖2, with Atτ = τ At + (1 − τ) ε, ε ~ N(0, I)[π₀ §IV]
In one breath: sample noise the shape of an action chunk, then take 10 small Euler steps along a learned vector field to morph it into a real action chunk conditioned on what the robot sees and the language command. This is not diffusion score matching — the target is a constant-velocity straight line between noise and data, not a noise-schedule gradient. Same family, simpler regression target, fewer steps at inference. The integration is forward Euler with δ = 0.1, ten steps. [π₀ §IV, §D]
One small idiosyncrasy worth flagging: π₀'s training-time τ sampling is not uniform. They use p(τ) = Beta((s − τ)/s; α=1.5, β=1) with s = 0.999, which emphasizes noisier (lower-τ) timesteps. This is the opposite of Stable Diffusion 3, which emphasizes the middle. The argument: action prediction is harder at high noise because the observation already constrains the answer at low noise, so the gradient should be spent on the hard part. [π₀ §B]
Sources: π₀ paper §IV, §B (flow-matching loss, τ-sampling).
CMP — The two VLA families, side by side
Discretize and AR-decode (RT-2, OpenVLA, π₀-FAST) versus keep continuous and flow-match (π₀, π₀.5).
Two design choices for “how does a VLM emit actions.” The discrete branch tokenizes actions and reuses the LM head: predict tokens one at a time. The continuous branch keeps actions as floats and adds a separate head — diffusion or flow matching. π₀ lives on the continuous side. The discrete branch gets its tokenizer on Page 3.
Latency on RTX 4090 with 3 cameras (π₀ paper §D, Table I): SigLIP image encoders 14 ms, prefix forward pass 32 ms, 10 action-expert forward passes 27 ms (~2.7 ms each), network latency 13 ms. Total on-board: 73 ms per chunk. [π₀ §D] The flow-matching cost (27 ms) is not the bottleneck — image encoding plus prefix forward dominate. The autoregressive equivalent (π₀-FAST, Page 3) is closer to ~750 ms per chunk per the FAST paper, ~10× slower at chunk-level latency.
Sources: π₀ paper §D Table I (latency); FAST paper §VI-D (AR latency); fast_deep_dive_2026-04-14.md.