Where you are. This is the climax. Pages 2 and 3 set up the problem: a continuous flow head needs to coexist with discrete language priors on the same backbone, and a noisy flow gradient backpropagated into a 3B VLM erodes those priors. KI is the answer to that.

Knowledge Insulation does roughly the opposite of what the name suggests. The VLM is not frozen. It trains. What's insulated is the path back from the flow-matching loss — that gradient is detached at the boundary, so the action expert's high-variance signal can't corrupt the VLM's language priors. A FAST cross-entropy head provides clean discrete supervision in its place. The VLM sees clean signal from three sources (language data, web VQA, FAST-CE on the action chunk); the action expert sees the noisy flow loss and keeps it to itself. [KI paper, arXiv 2505.23705]

Sources: KI paper (arXiv 2505.23705); π₀.5 paper (arXiv 2504.16054); Pertsch (PI), X/Twitter, April 2025.

VLM (PaliGemma) trained

Receives gradients from (1) the language / web data losses and (2) the FAST-token cross-entropy auxiliary head. Does not receive gradients from the flow-matching loss. The VLM is not frozen — it is selectively shielded.

FAST-CE auxiliary head trained

A discrete categorical head sitting on the VLM. Predicts the FAST-tokenized version of the same action chunk the flow expert is regressing toward. Cross-entropy loss; gradients do flow back into the VLM.

Action expert (flow head) trained

Reads VLM features through a stop-gradient. Updates its own weights from the flow-matching loss. Its gradient never crosses back into the VLM. Runtime output is continuous flow — FAST tokens are NEVER decoded at inference.

KI1 — Where gradients flow, and where they don't

Toggle pretrain π₀ / finetune π₀ / KI π₀.5 train / KI π₀.5 inference. Watch the backward arrows recolor and disappear.

Why is the asymmetry necessary? The flow-matching loss has time-dependent variance: it spikes near τ=0 (predicting the mean is ill-defined when noise dominates) and near τ=1 (the residual is small but high-frequency). Backpropagating that high-variance, non-stationary gradient into the VLM erodes its language priors during co-training. The VLM is the crown jewel. The training recipe exists to protect it.

KI's two-piece insulation:

  1. Stop-gradient at the VLM → action-expert interface. The action expert reads VLM features (forward pass: clean activations flow). The action expert's flow loss propagates inside the action expert's own weights but is detached at the boundary — it does not flow into the VLM. Concretely: a stop_gradient() wraps the VLM features the action expert consumes.
  2. FAST-token cross-entropy auxiliary head, hung off the VLM. The same action chunk the flow expert regresses against is also FAST-tokenized; an auxiliary head on the VLM predicts those discrete tokens via standard next-token CE. This gradient does flow back into the VLM. So the VLM still receives an action-relevant learning signal — but as a clean, stationary, low-variance categorical signal, not a noisy velocity-field signal.
“Training with discrete FAST action tokenization now powers all of our pre-training in π0.5!” — Karl Pertsch (PI / FAST first author), X / Twitter, April 2025

This is why the Pertsch quote matters precisely: pre-training, not inference. The VLM is co-trained on a heterogeneous mixture (language, web VQA, object detection, FAST-CE on actions) all of which contribute clean gradients. The action expert is co-trained on flow matching, isolated. Two gradients, one backbone, one direction of leakage. [KI §3]

Sources: KI paper §3 (arXiv 2505.23705); fast_deep_dive_2026-04-14.md §4.

TI1 — Train-time vs inference-time data flow

Same network, two paths. At training, both heads fire. At inference, only the flow head emits.

The four common misreadings, in order of frequency:

  • “FAST is the inference path.” No. At inference, only the flow-matching action expert runs. The FAST auxiliary head is dead weight at deployment — only its representational shaping of the VLM during training carries forward.
  • “KI = freeze the VLM.” No. The VLM is updated by language losses, web data, and FAST-CE. It is not updated by the flow-matching loss. The selective protection is the trick.
  • “FAST tokens are decoded then converted to actions at inference.” No. There is no FAST decoding step at inference in π₀.5 / KI. Continuous flow is the runtime artifact.
  • “The action expert and VLM are separate transformers.” No (carry-over from Page 2). They are two parameter sets inside one transformer. Stop-gradient is applied at the VLM features the action-expert tokens read — one tensor cut, not a structural separation.

One honest open question. KI's ablations remove the FAST-token aux loss, the stop-gradient, or the co-training data. The biggest performance drop comes from removing the aux loss, suggesting the discrete categorical target is doing more representational work than the stop-gradient itself. The KI paper does not ablate against LoRA-only or adapter-only fine-tuning. VLM2VLA directly rebuts with “simpler LoRA works.” Whether KI is uniquely necessary or just one valid solution is unsettled in the public literature. [fast_deep_dive_2026-04-14.md §4.4; pi_papers_multiagent_discussion_2026-04-14.md Round 2]

Sources: KI paper §3, §4 (arXiv 2505.23705); π₀.5 paper §3 (arXiv 2504.16054).

One sentence to leave with. The same network sees three loss signals. Two of them train the VLM (language / web data, plus FAST cross-entropy on the action chunk). One trains the action expert (flow matching). The flow gradient never crosses. That is the whole mechanism — one tensor cut at one boundary, plus a clean discrete supervisor on the VLM side. Everything Page 5 says about co-training, data mixtures, and open-world generalization rests on this single asymmetry.