Section 1 · the substrate

What the model is made of.

π₀.5 takes pixels and language as input and emits robot actions as output. Before any of that, both inputs have to be chopped into pieces a transformer can read.

A neural network does not natively understand a kitchen photo or the sentence pick up the sponge. It understands vectors of numbers. The first job of the model is to turn both into the same kind of thing — a flat sequence of vectors we call tokens.

For text, a tokenizer chops a sentence into sub-word units (pick, up, the, spo, nge) and looks each one up in a fixed dictionary. Every entry has a learned vector, typically a few hundred numbers wide.

For images, the trick is grid-and-flatten. A 224×224 photograph is sliced into a 14×14 grid of 16-pixel patches; each patch runs through a small network and is squeezed into a single vector of the same width as a text token. A photo becomes 196 vectors. ¹ Both streams concatenate into one long row.

14 × 14 = 196 patches at the default 224-pixel input. PaliGemma is the open vision-language model π₀.5 starts from.
Figure 1.1 — A photo and a sentence become one sequence
image patch / text token (VLM-routed) raw pixels / characters
A 14×14 grid of patches over a kitchen scene flattens into 196 patch vectors; a five-word command tokenises into five sub-word vectors; both rows live in the same sequence.

Attention: how a token decides what to listen to.

Once the sequence exists, tokens have to talk to each other. The mechanism is called attention, and it is mechanical.

For every token, the model produces three short vectors from its embedding: a query, a key, and a value. The query is a question (“what kind of word goes after cat?”); the key is a label the token advertises (“I am a verb”); the value is content the token will hand over (“sat”).

To update a token, the model dot-products its query against every other token's key, normalises the scores with softmax so they sum to one, and uses those weights to take a weighted average of the values. That weighted average is the token's new representation.

SCALED DOT-PRODUCT ATTENTION
$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d}}\right) V$$

The $\sqrt{d}$ keeps dot products from blowing up when vectors get wide; without it, softmax saturates and the model stops learning.

Figure 1.2 — A single attention head, one token at a time
click a token:
query token (the one asking) strong attention weak attention
Click any word in the cat sat on the mat. Step 1 shows raw dot products between that word's query and every key; step 2 normalises them with softmax; step 3 mixes the values according to those weights.

Multi-head attention: several questions at once.

One attention computation forces every token to ask one question per layer. The fix is to run several in parallel, each on its own slice of the embedding. Each slice is called a head and learns its own query/key/value matrices. One head ends up tracking word order; another tracks which subject a verb belongs to; another binds adjectives to their nouns. The heads' outputs concatenate and pass through one final projection.

Figure 1.3 — Four heads, four reading strategies
head:
head output strong attention weak attention
Four hand-authored heads on the sentence the red cube is to the left of the blue cube; their outputs concatenate end-to-end and pass through one final linear projection.

A transformer block: attention plus a small per-token network.

A transformer stacks identical blocks. One block does two things in order: multi-head attention, so tokens mix information across the sequence, then a small two-layer network (a feed-forward or MLP) applied to each token independently — this is where most parameters live. Residual connections let information skip past each sublayer; layer norm keeps activation sizes stable as the stack goes deep.

Stack 18 of these blocks and you have the language-model half of PaliGemma — the backbone π₀.5 starts from. ²

PaliGemma combines a SigLIP vision tower (~400M params) with a Gemma-2B language model. Together: ~3.0B parameters of vision-language model.
Figure 1.4 — One transformer block, animated
stack 18 of these blocks → PaliGemma backbone
token flowing through active sublayer residual stream
A single token's path through one block: norm, attention, residual add, norm, feed-forward, residual add. Eighteen of these stacked is the language backbone of π₀.5.

The mask: who is allowed to look at whom.

Attention defaults to letting every token see every other. For a model generating text one token at a time, that is too much — a token must not peek at words it has not produced yet. The fix is a mask: a 2D table the same size as the attention matrix, with forbidden cells set to $-\infty$ before the softmax (which forces their weights to zero).

π₀.5 uses a pattern called prefix-LM. The input — image patches plus the language prompt — is the prefix; tokens inside the prefix see each other in both directions because the model is reading them, not generating them. Anything after the prefix — the suffix — is causal: each new token sees all the prefix and only the earlier suffix tokens.

Figure 1.5 — The prefix-LM attention mask
hover any row to see what that token attends to
prefix ↔ prefix (bidirectional) suffix → prefix (read input) suffix → suffix (causal) blocked
An attention mask for ~24 tokens (16 image patches + 5 text tokens + 3 generated tokens); allowed regions are coloured by role, blocked cells are dark.

That is the substrate. Tokens are vectors; attention is a learned weighted average; a block is attention plus a per-token network with residuals; stack the blocks under the prefix-LM mask and you have the vision-language model π₀.5 is built on.

The next page introduces the action expert — a second transformer that talks to this one and produces continuous robot actions instead of words.

One layer's worth of arithmetic, in numbers

In the Gemma-2B language model used by π₀.5's backbone, each token is a vector of 2048 numbers. Attention runs with 18 heads, so each head sees 2048 / 18 ≈ 114 dimensions per query/key/value. The feed-forward sublayer expands every token to 16384 and projects back. Eighteen blocks stack on top of each other; the attention mask is shared across blocks but the weights are not. Image patches use the same dimension as text tokens because PaliGemma projects the SigLIP vision-tower outputs into the language model's token space with a single linear map.