π₀ Workshop — Page 5: π₀.5

Where you are. The payoff page. You have the substrate (Page 1), the architecture (Page 2), the surviving tokenizer (Page 3), and the co-training recipe (Page 4). This page assembles them into π₀.5 and asks the honest question: how much of the open-world generalization claim survives PI's own ablations?

π₀.5 keeps π₀'s runtime architecture almost unchanged. It bolts on a discrete FAST-token pretraining objective, a heterogeneous co-training data mix (web VQA, cross-embodiment teleop, ~400 hours of mobile-manipulator demos across 104 homes), a hierarchical “predict-then-act” inference loop where the VLM emits a natural-language subtask before the action expert runs, and the gradient-isolation recipe formalized one month later as Knowledge Insulation. The headline claim is open-world generalization to unseen homes; the honest framing is co-training curriculum, not architecture novelty. [π₀.5 paper, arXiv 2504.16054]

Sources: π₀.5 paper (arXiv 2504.16054); KI paper (arXiv 2505.23705); fast_deep_dive_2026-04-14.md.

The five-axis diff — same architecture, new recipe

Green rows changed; grey rows stayed the same. Read both columns — that's the whole story.

Axis	π₀ (Oct 2024)	π₀.5 (Apr 2025)
Backbone	PaliGemma (SigLIP-So400m + Gemma-2B), ~3.0B params, end-to-end fine-tuned. Crown-jewel parameter mass.	Same PaliGemma backbone. Protected from flow-loss gradient by stop-gradient (formalized in KI a month later). No backbone swap, no architecture novelty.
Action head	~300M flow-matching action expert, mixture-of-transformer-experts attention with the VLM. 10 Euler steps at δ=0.1 to denoise a chunk. [π₀ §IV]	Same ~300M flow-matching expert at runtime. Identical Euler-step inference. The runtime is unchanged.
Training objective	Single objective: flow-matching velocity-prediction loss on continuous action chunks. No FAST-token aux. No stop-gradient. [π₀ §IV]	Two-piece: continuous flow-matching loss on the action expert plus FAST-token cross-entropy auxiliary on the VLM head. Flow gradient stop-gradient'd at the VLM boundary. KI later unifies these into one simultaneous loop. [π₀.5 §3; KI §3]
Data mixture	~10k hours teleop across 7 robot configs / 68 tasks, plus the OXE Magic Soup subset. [π₀ §V-A]	All of π₀'s data plus: ~400 hours mobile-manipulator across 104 homes; web VQA / OCR / detection; high-level subtask-language annotations (~11% of HL data); more cross-embodiment teleop. Heterogeneous co-training. [π₀.5 §3]
Inference path	Single-pass: VLM forward → 10 flow steps through action expert → continuous chunk. [π₀ §D]	Hierarchical “predict-then-act”: VLM first emits a high-level subtask string (“pick up the sponge”) by autoregressive language decode, then conditions the action expert on that linguistic intermediate to emit the next chunk. Same flow-matching runtime, with a coarse-rate language pass on top. [π₀.5 §3]
Generalization claim	Strong on in-distribution dexterous tasks (laundry, table bussing, grocery bagging). No claim of unseen-home generalization. [π₀ §V-C]	“Open-world” generalization to 3 unseen homes for long-horizon mobile manipulation. Honest reframe: better than 3-home pretraining baseline; eval is narrow; prompt-rewording can swing 0→100%. [π₀.5 §4; Penn GRASP independent eval]

Read this as: two grey rows (architecture & runtime — unchanged); four green rows (training, data, inference loop, generalization claim — the actual diff).

Data mixture — lab to web, in two bars

π₀ was lab-only. π₀.5 is heterogeneous. Where each slice helps (and where it doesn't, per PI's own ablations).

What π₀ saw: ~10k hours of in-house teleop across 7 robot configurations / 68 tasks, plus the OXE Magic Soup subset. Lab-distribution. What π₀.5 added on top: ~400 hours of mobile-manipulator demos across 104 homes, web VQA / OCR / object-detection batches, broader cross-embodiment teleop, and ~11% of the high-level data carrying explicit subtask-language annotations. Same training step sees a heterogeneous batch. Three honest caveats follow from π₀.5's own ablations and from independent evaluations.

Web data is not the active ingredient. PI's own paper reports the “no web-data” ablation is not statistically significant on the main tasks. The dominant generalization levers are cross-embodiment teleop and subtask-language annotations (~11% of HL data), not the web VQA / OCR / detection co-training data — even though “we add web data and generalization emerges” is the easiest-to-pitch reading. Defensible explanation: web data preserves the VLM's language-conditioning capability rather than directly teaching manipulation. [π₀.5 §4 ablation table]
Embodiment diversity matters; data scale alone doesn't. 104-home pretraining vastly outperforms a 3-home baseline on unseen-home generalization. The slope is steep enough that PI uses it as evidence of a scene-/embodiment-diversity scaling law inside this regime. The claim is real against the prior baseline.
The eval is narrow. Three homes. Penn GRASP's independent in-the-wild evaluation reported 0% success on novel glass and unseen backgrounds; prompt rewording can swing results 0→100% on π₀ (and likely π₀.5). LeRobot's pi05_libero weight port hit 0% out of the box until preprocessing was fixed (HF issue #2533). The contribution is real; the “open-world” framing is stronger than the eval supports. [fast_deep_dive_2026-04-14.md §5]

Sources: π₀.5 paper §3 (data mixture), §4 (ablations); Penn GRASP independent eval; fast_deep_dive_2026-04-14.md.

What is not new in π₀.5

The architecture barely moved. The recipe and the data did all the work.

Still a transformer. PaliGemma + action expert with mixture-of-transformer-experts attention. No new backbone family, no new attention variant.
Still flow at runtime. Continuous output, ~10 Euler steps, identical to π₀. FAST tokens are training-only.
Still chunked actions. H = 50 timesteps, D = 18 (zero-padded). Half-chunk replanning cadence carried over from π₀.
Still PaliGemma backbone. SigLIP-So400m + Gemma-2B, end-to-end fine-tuned. No swap.
Still the prefix-LM mask. Three-block causal layout from Page 2. The hierarchical predict-then-act layer sits on top of this; the underlying mask is unchanged.

FAQ — five things you didn't know you didn't know

Q. Are π₀-FAST and π₀.5 the same model?

No. π₀-FAST (Jan 2025) is a sister architecture that drops the action expert entirely and lets the VLM autoregressively emit FAST tokens at runtime. It was retired by PI by May 2025 because of latency. π₀.5 keeps π₀'s flow-matching action expert and uses FAST tokens as a training-time auxiliary target only. Different models, same tokenizer, very different fates. [FAST §5; KI paper]

Q. Is the hierarchy a learned latent or natural-language tokens?

Natural-language tokens. The high-level “subtask” bottleneck is a short text emission (“pick up the sponge”) that the same VLM autoregressively decodes before conditioning the action expert. This buys compositional generalization for free (the VLM's language head already knows how to combine concepts) at the cost of an extra autoregressive decode at chunk boundaries. It is the same Hi Robot-style System-2 / System-1 split, applied inside a single VLA. [π₀.5 §3]

Q. Does the high-level decode run on every control step?

No — it runs at chunk boundaries, not at control rate. Concretely: a new subtask emits at the start of each chunk (~every 50 timesteps at 50 Hz, so ~1 Hz), or when the prior subtask is detected as complete. The low-level action expert runs every chunk. So the rates are coarse: HL ~1–2 Hz, LL chunk-emission ~10–50 Hz, control loop ~50 Hz with interpolation. [Hi Robot precedent; π₀.5 §3]

Q. Why is RTC compatible with π₀ / π₀.5 but not π₀-FAST?

Real-Time Chunking (arXiv 2506.07339) treats the seam between consecutive chunks as a diffusion inpainting problem — freeze the tail of the previous chunk as known context, use masked diffusion to inpaint a smooth start to the next. This is mathematically possible only on continuous diffusion / flow heads. Autoregressive token streams have no “partially denoised chunk” for the inpainter to operate on. LeRobot's RTC implementation (PR #1698, Nov 2025) explicitly supports π₀, π₀.5, SmolVLA — not π₀-FAST. So even if AR-FAST inference were fast, RTC would still force a flow head at runtime for any deployment that cares about latency tolerance.

Q. Is KI uniquely necessary, or could LoRA do the same job?

Unsettled. The KI paper does not include a LoRA-only or adapter-only ablation. VLM2VLA directly rebuts with “simpler LoRA works.” KI's principled argument (discrete categorical aux target stabilizes a continuous generative backbone, in the spirit of BERT MLM / VQ-VAE commitment loss) is real, but the head-to-head with PEFT recipes is open public business. The honest read: KI is a solution; whether it is uniquely necessary is not yet shown.

What you actually walked away with

Six months between papers. The architecture barely moved — same backbone, same flow head, same attention layout, same chunked output. What moved was the training recipe (KI: stop-gradient + FAST cross-entropy auxiliary), the data mixture (lab teleop + 104 homes + cross-embodiment + web VQA + subtask language), and the inference loop (a hierarchical language pass on top of an unchanged action expert). The result is a policy that, against the prior 3-home baseline, sometimes works in unseen homes — and an honest reframe that the “open-world” framing is stronger than the eval supports.

One-line landing: π₀.5 = π₀ architecture + Knowledge Insulation co-training + a heterogeneous data mixture. Inference is unchanged. Open-world generalization is a data + recipe result, not an architecture result.

If you read all five pages, you should be able to answer three questions cold: what is in the action chunk and how does it get there (Page 2), what survived from FAST and what didn't (Page 3), and what does “the gradient stops here” mean (Page 4). The rest is data and patience. Back to index →