The rest of this site reuses four ideas without re-introducing them: scaled-dot-product attention, multi-head splitting, the prefix-LM mask, and the mixture-of-transformer-experts pattern. Page 1 sets that vocabulary precisely. If you've trained a transformer recently, scan the math; the prefix-LM viz (M1) is the load-bearing one.
The walk: a tight math bridge → A1 (drag Q/K, watch softmax) → A2 (four heads, four relations) → M1 (causal / bidirectional / prefix-LM — this is π₀'s mask) → D2 (one transformer block tensor-by-tensor, with the MoE hole π₀ fills).
Sources: π₀ paper (arXiv 2410.24164) §B; Vaswani et al. 2017 (canonical scaled-dot-product attention); PaliGemma technical report (prefix-LM masking).
The four formulas everything else assumes
1. Scaled dot-product attention
For one query and one key, the score is sij = qi · kj / √d. The dot product is similarity in ℝd: when token i's question vector aligns with token j's answer vector, the score is large. Stack scores into a matrix S = QKT/√d and softmax row-wise:
Attention(Q, K, V) = softmax(QKT/√d) V
Three things to internalize. The dot product is the matchmaking: high qi·kj means token i finds token j relevant. The softmax is per-row: each query's attention weights sum to 1, so attention is a competition among keys, not an absolute score. The √d is a variance fix: at large d, q·k grows like √d; without the divisor, softmax saturates and gradients die. It is not "averaging."
2. Multi-head splitting
Instead of one attention with width d_model, run h in parallel with head_dim = d_model / h each. The total compute is unchanged — you split the channels, you didn't add capacity. What you bought is specialization: each head has its own (WQ, WK, WV), so different heads learn different relations (positional, syntactic, coreferential). Concatenate their outputs and project through one shared WO: ℝd_model → ℝd_model to recombine. In Gemma-2B / π₀ backbone: d_model = 2048, h = 18, head_dim = 256, with multi-query attention (1 KV head shared across all 18 Q heads). [π₀ §B]
3. Causal mask
For autoregressive generation, token t may not see tokens > t. Implementation: add a mask M to the scores before softmax, where Mij = −∞ for j > i and 0 otherwise. Setting an entry to −∞ makes its softmax weight exactly zero. The mask is upper-triangular −∞; the allowed region is the lower triangle (including the diagonal). At inference, this is what lets you cache K and V from prior tokens and only compute one new column per step — the KV cache.
4. Prefix-LM (the bridge to π₀)
PaliGemma's twist, which π₀ inherits: pick a prefix length k. The first k tokens are mutually bidirectional — image patches, language prompt, robot state all see each other. The remaining tokens are causal — each one sees the entire prefix and the suffix tokens up to itself. Mask shape: a fully-white k×k block in the upper-left (the prefix), a lower-triangular block on the suffix, and the off-diagonal block lets every suffix query read every prefix key but never the other way around. This is π₀'s attention pattern, with the suffix being the noisy action chunk being denoised over 10 flow-matching steps. [π₀ §B, "Attention mask"]
5. Mixture-of-transformer-experts (a preview, not Page 1's job)
π₀ is one transformer with two sets of weights ("experts"): a PaliGemma-initialized expert for image and language tokens, an action-expert for state and noisy-action tokens. The two experts share only the attention computation — they have separate Q/K/V/O projection weights and separate MLPs, but inside each layer the K's and V's from both experts are concatenated and every Q sees every allowed K. The routing is deterministic by token type, not learned gating. [π₀ §B, "Action expert"] Page 2 unpacks this; for now just notice that the diagram fits inside one transformer block.
A1 — Attention as Q/K/V matchmaking
Drag any K. Watch the heatmap row redistribute via softmax. Click a preset to see three canonical attention patterns.
Eight toy tokens on a 1-D line. Each carries a Q arrow, a K arrow, and a colored V swatch. The middle panel is the 8×8 attention heatmap: cell (i,j) = softmax(qi·kj / √d)j. The bottom row is each token's mixed-V color — the layer's output. Drag a K, watch its column recolor and the output row blend. Toggle "pre-softmax" to see raw q·k/√d: the diverging map shows the softmax doing a lot of work, turning roughly-equal raw scores into a sharp competition.
The three presets snap K positions to canonical patterns. Self-loop heavy: Ki=Qi → the diagonal lights up. Next-token: Ki=Qi−1 → the sub-diagonal lights up (this is what positional heads do). Global mix: K's are tiny → q·k is small everywhere → softmax flattens into a near-uniform row.
Sources: Vaswani et al. 2017 (canonical scaled-dot-product attention).
A2 — Why multiple heads
Different heads attend to different relations. Total dim is fixed; heads slice it.
The fixed prompt is "the red cube is to the left of the blue cube." Four heads, hand-authored to make the point: one is positional (each token mostly attends to its neighbor), one is syntactic (verbs find their subjects), one is color-coreference ("red" binds to its "cube", "blue" binds to its "cube"), one is object-coreference (the two "cube" tokens find each other). Click a head → see only its pattern. Click a token → see all four heads light up where that token attends.
Two myths to puncture. Myth one: "more heads = more capacity." Heads slice a fixed total dim into h equal subspaces — with d_model = 2048 and h = 18 (the π₀ backbone), each head is head_dim = 256. [π₀ §B] The capacity gain is in specialization — different subspaces capture different relations — not raw parameters. Myth two: "heads are independent layers." They are parallel projections within one layer, concatenated and projected back to d_model by a single output matrix WO.
Sources: Vaswani et al. 2017; π₀ paper §B (Gemma-2B head config).
M1 — The triangular veil and the prefix-LM exception
Causal mask = upper triangle set to −∞. Toggle to prefix-LM to see π₀'s exact attention pattern — the climax of Page 1.
Three modes. Causal: position t may attend only to positions ≤ t. The mask is added before softmax, not after — setting an entry to −∞ is exactly what makes its softmax weight zero. Bidirectional: every position sees every other (BERT, encoders). Prefix-LM: the first k tokens are mutually bidirectional; everything after is causal with respect to the prefix and to itself. PaliGemma uses prefix-LM so image patches and the language prompt all see each other (they are the question), while the generated suffix is causal (it is the answer being decoded one token at a time).
This is exactly the pattern π₀ reuses, with a twist we'll meet on Page 2: three blocks [image+text] → [state] → [action], bidirectional within each block, causal between them. [π₀ §B, "Attention mask"] The action chunk attends to everything earlier; the VLM prefix never attends forward into actions; the state token caches once across all 10 flow-matching steps. Toggle prefix-LM mode in the viz — the orange overlay is the prefix block; the white square in the lower-right is the suffix's causal triangle.
Sources: π₀ paper §B (block-causal attention mask); PaliGemma technical report.
D2 — One transformer block, tensor by tensor
What lives where. Scrub through the eleven sub-ops of a single decoder block, then notice where π₀ inserts a second expert.
One decoder block, in order: input → RMSNorm → Q/K/V projections → attention → attention-output projection → residual add → RMSNorm → MLP up-projection → activation → MLP down-projection → residual add. The residual stream is the "shared bus" every sub-layer reads from and writes to. Norms keep activations stable; the MLP does most of the per-token compute and holds most of the parameters — mlp_dim = 16384 in Gemma-2B vs width = 2048. [π₀ §B]
This is also the unit where π₀'s mixture-of-experts lives. The action expert is not a separate transformer with cross-attention to the VLM — it is a second set of weights inside this same block, with token-wise routing. Image and text tokens hit the big PaliGemma MLP weights (width=2048, mlp_dim=16384); state and action tokens hit a smaller action-expert MLP (width=1024, mlp_dim=4096). Both kinds of tokens flow through the same attention computation, mixing inside one Q·K·V product. This is structurally a Shazeer-style mixture of experts with deterministic per-modality routing instead of a learned gate. [π₀ §B, "Action expert"] Page 2 unpacks this; for now, just notice that the action expert lives inside this block diagram, not next to it.
Sources: π₀ paper §B (Gemma-2B configuration; "Action expert").
Bridge to Page 2
Four artifacts now in head: scaled-dot-product attention, the multi-head split, the prefix-LM mask, and a one-block dataflow with a hole where the MLP can be split into two experts. Page 2 wires them together. A VLM is this stack with image patches concatenated to the prompt under a prefix-LM mask. π₀ adds a second expert — same attention block, separate MLP weights — for continuous actions instead of language; the action expert reads the cached VLM prefix through shared self-attention and emits a velocity field integrated over 10 Euler steps.