Section 5 · the training recipe

Two losses, two paths.

π₀.5 trains by routing a continuous loss into the action expert and a discrete loss into the language model — and refusing, deliberately, to let either gradient cross over.

So far this has been a story about runtime. Training is a different shape. At training time the same network sees several losses at once: the action expert is hit with a flow-matching loss against a real demonstration chunk, and the vision-language model is hit with ordinary language modelling on web text, image captioning, and visual question-answering. On top of those, there is one extra loss that is the whole point of this page.

If you wired this naively — one network, all losses summed, gradients allowed to flow wherever they want — the system breaks.

The pathology.

Flow matching predicts a velocity at a noise level $t \in [0,1]$. The variance of that target depends on $t$: near $t \approx 0$ the chunk is almost pure noise and the velocity needs to do nearly all of the work; near $t \approx 1$ the chunk is almost the clean answer and the velocity is small but sensitive. Across the whole schedule the gradient signal is non-stationary — spiky and badly behaved.¹

The variance pathology was first observed in earlier end-to-end vision-language-action recipes; the Knowledge Insulation paper (arXiv 2505.23705 §1) names it as the motivation for separating the two gradients.

If those gradients propagate back into a vision-language model carefully pre-trained on the open internet, they slowly corrupt its language priors. The model still moves the robot, but its instruction-following degrades and its captioning drifts. You have spent a lot of compute teaching a language model how to be a slightly worse language model.

Figure 5.1 — Where each gradient is allowed to go

VLM (PaliGemma) action expert flow head / forward FAST aux head & gradient flow gradient (stopped)

Solid arrows show the forward pass. Dashed animated arrows show backward gradients. In KI mode the flow gradient meets a hard stop at the VLM boundary; the FAST cross-entropy gradient reaches the VLM normally.

The two-piece fix.

One. A stop-gradient sits at the boundary between the action expert and the VLM. In the forward pass the action expert reads VLM features as if nothing has changed; in the backward pass the gradient is detached — literally, in code, an operation that copies activations forward but returns zero on the way back.² The flow loss never reaches the VLM.

In PyTorch the operation is x.detach(); in JAX it is jax.lax.stop_gradient(x). The forward output is identical to x; the gradient through it is zero.

Two. A second, well-behaved loss is added on the VLM side. The same action chunk that the flow expert regresses toward is also passed through a tokenizer called FAST — explained next — which compresses it into roughly thirty to sixty discrete integer tokens. A small auxiliary head hangs off the VLM and predicts those tokens via categorical cross-entropy. This loss is allowed to update the VLM. Cross-entropy on a discrete target has stationary, well-conditioned gradients; the language priors take it in stride.

That is the asymmetry. Flow gradient: action expert only. FAST cross-entropy gradient: VLM. Two gradients, two paths, by construction.

What FAST actually does.

The FAST tokenizer takes the same continuous chunk — fifty timesteps of an eighteen-dimensional action vector — and runs four transforms on it. First, a discrete cosine transform along the time axis turns each joint's trajectory into frequency-domain coefficients; for smooth motor traces almost all the energy lives in the lowest dozen frequencies. Second, those coefficients get uniformly quantized into a small set of integer bins. Third, a byte-pair-encoding pass merges the most common adjacent pairs in the integer stream, the same idea that subword text tokenizers use, exploiting the fact that motor primitives repeat. Fourth, the result is a short sequence of token ids drawn from a vocabulary of about a thousand to two thousand entries.³

FAST is a published tokenizer (Pertsch et al., arXiv 2501.09747) originally proposed for autoregressive vision-language-action models; in π₀.5 it is repurposed as an auxiliary objective during training only. The tokenizer itself was pre-trained on roughly a million robot trajectories, which is why it generalizes across embodiments.

Figure 5.2 — FAST: from continuous chunk to ~40 discrete tokens

kept FAST coefficients / tokens truncated frequencies joint trace 0 …

Click any stage to highlight it. The pipeline turns 350 floats per second of motion into roughly forty token ids — a clean target for a categorical loss.

The discrete tokens are only a training target. At runtime the model never produces them — the action chunk is generated by flow matching.⁴ FAST exists in this architecture because cross-entropy on a discrete target is the well-behaved teacher the VLM needs, the same family of trick that BERT's masked language modelling and MaskGIT use in adjacent settings.

Pertsch (FAST first author), April 2025: "discrete FAST action tokenization now powers all of our pre-training in π0.5". The tokens are the pre-training signal, not the runtime output.

The thesis from the index page, in mechanism: π₀.5 uses a discrete categorical objective to shape the language model during training, and a continuous flow-matching objective to act at runtime. The recipe makes π₀.5 trainable. But trainable for what?

JOINT TRAINING OBJECTIVE

$$\mathcal{L} = \underbrace{\lambda_\text{flow}\,\mathbb{E}\,\| v_\theta(x_t,t) - (x_1 - x_0) \|^2}_{\text{action expert only (stop-grad at VLM)}} \;+\; \underbrace{\lambda_\text{FAST}\,\mathbb{E}\,\text{CE}\!\left(\hat{y},\, y_\text{FAST}\right)}_{\text{VLM is updated}} \;+\; \mathcal{L}_\text{LM} + \mathcal{L}_\text{caption} + \mathcal{L}_\text{VQA}$$

Three or four loss terms, summed; the partition of who-can-update-what comes from the stop-gradient, not from the equation itself. Page 6 turns to what this recipe actually buys — and what it doesn't.

One-line code sketch of the gradient routing

vlm_features = vlm(image, text)                       # shape: [B, T_vlm, D]
fast_logits  = fast_aux_head(vlm_features)            # gradients reach VLM
loss_fast    = cross_entropy(fast_logits, fast_tokens)

vlm_for_expert = stop_gradient(vlm_features)          # ← the load-bearing line
v_pred         = action_expert(vlm_for_expert, x_t, t)
loss_flow      = mse(v_pred, x1 - x0)

loss = lambda_flow * loss_flow + lambda_fast * loss_fast + loss_lm + ...

In code, KI is one extra line: stop_gradient on the VLM features before they enter the action expert. Everything else falls out of that.

Is the FAST head doing the real work, or is the stop-gradient enough?

Per the KI paper's own ablations, removing the FAST cross-entropy auxiliary loss drops out-of-distribution performance more than removing either the stop-gradient or the heterogeneous co-training data. The stop-gradient on its own is a safety belt — it prevents corruption — but it does not by itself give the VLM an action-relevant signal. The FAST head is what teaches the VLM that some image-text inputs imply action structure.

Open question: KI ablates the choice of FAST, not the choice of discrete target. A learned VQ-VAE codebook, a wavelet quantization, or a simple binned representation might do equivalent variance-reduction work. The proven claim is "a discrete auxiliary target stabilizes the continuous generative backbone"; whether DCT-BPE specifically is the right discrete target is unsettled.