Section 6 · what the recipe buys

What generalizes, what doesn't.

π₀.5 is trained on a deliberate mixture of five data streams. Two of them do the work. One is supportive. One — the most loudly advertised — does almost nothing.

The training batch in π₀.5 is heterogeneous on purpose. A single update step pulls examples from five sources, weighted into roughly the proportions shown below. The architecture treats them uniformly — same forward pass, same loss machinery from the previous page — but their effects on what the deployed model can do are wildly unequal.

Figure 6.1 — Five data streams, three ablation verdicts
active lever — removing it significantly hurts OOD performance support — sample-efficient backbone but not a novel lever not significant — ablation does not move the metric
Hover any slice to see the source dataset and the ablation verdict from π₀.5 §5.

Cross-embodiment data — demonstrations collected on multiple robot platforms in a roughly OXE-style format — is an active lever. Removing it drops unseen-home performance significantly. Same with subtask-language annotations: the roughly eleven percent of the high-level data labelled with short natural-language strings ("pick up the sponge", "open the cabinet") is what lets the hierarchical reasoner emit useful subtask plans. Pulling either of those out of the recipe collapses the result.

The ~400 hours of mobile-manipulator data collected across 104 homes — the headline ingredient that gets the most attention — is supportive but not novel. It is the in-domain fuel; without it the model has nothing to imitate at deployment time. With it, the lever-pulls above are what actually drive generalization. Web VQA, OCR, and detection — the kind of data that PaliGemma was originally trained on, the kind that suggests "internet-scale knowledge transfers to robots" — is, by π₀.5's own ablations, not statistically significant on the main tasks. This is the surprise. Anyone who frames π₀.5 as "add web data and generalization emerges" is wrong about π₀.5.¹

π₀.5 paper, Black et al. arXiv 2504.16054 §5: "no WD ablation is not statistically significant" on the main tasks. Authors are explicit about this in the paper, less so in the marketing.

Sized correctly, the result is real.

The headline experimental setup is straightforward to state: pre-train on the mixture above, evaluate in three unseen homes — kitchens and bedrooms the model has never seen during training — on long-horizon language-conditioned tasks (clean the kitchen, make the bed, put the laundry away). π₀.5 reports non-zero success on these tasks, with the model invoking its own subtask plans and recovering from intermediate failures.

That is a real result. The prior baseline of "drop a vision-language-action model into a kitchen it has never seen" is essentially zero. The contribution is genuine. It is also not zero-shot generalization to arbitrary settings.

Figure 6.2 — The generalization gap, by lever
training distribution unseen home (held) caveat — see caption
Click a lever to ablate that data slice; the unseen-home bar updates. Cross-embodiment and subtask-language drops are large; web-VQA drop is essentially noise.

Where the eval breaks.

Outside the paper's own evaluation set, the picture is rougher. An independent in-the-wild evaluation by Penn GRASP reports zero percent success on novel glass surfaces and on unseen backgrounds — cases the model was not exposed to during training, where the visual statistics simply fall off the manifold. Prompt rewording — the same task described with different words — can swing success rates from zero to one hundred percent on the same scene; the policy is more sensitive to language phrasing than the marketing implies. And the LeRobot weight port hit zero percent out of the box until image preprocessing was fixed, suggesting the system is not yet plug-and-play reproducible outside the authors' stack.²

Penn GRASP independent evaluation; LeRobot HF issue #2533 (preprocessing); see also Correll Lab review describing the eval as "a narrow slice of real-world manipulation."

Three homes, evaluated by the team that built the model, with prompt sensitivity that swings results across the entire scoring range. The right summary is: π₀.5 generalizes to unseen-but-statistically-similar kitchens. Open-world overstates the eval.

What is still open.

The natural failure modes — novel object geometry, unfamiliar lighting, transparent or reflective surfaces — are not yet fixed by this recipe. Whether they get fixed by more data, by a different discrete target, or by a different inference loop is unsettled. The paper does not include a head-to-head with Gemini Robotics or other closed vision-language-action systems on shared benchmarks; that comparison is a conspicuous absence and the most natural follow-up question. And the LoRA-versus-Knowledge-Insulation question raised on the last page is an open empirical claim — KI works, but whether simpler parameter-efficient fine-tuning works just as well has not been tested in print.

π₀.5 is a calibration point. It demonstrates that a small architectural surface area — one transformer with two MLP banks, an action expert, a flow head, a stop-gradient, a discrete auxiliary target — combined with a careful training recipe and a heterogeneous data mix can produce non-trivial generalization in homes the model has never seen. The next versions, on this trajectory, will tell us how much further the recipe goes.

The five-slice mixture and what each does, in one paragraph each

Lab teleop: the legacy Pi multi-robot teleoperation fleet, single-embodiment in spirit. Foundational; not a novel lever for π₀.5.

Mobile-manipulator data, 104 homes: roughly 400 hours of in-domain demonstrations across 104 distinct homes, collected for π₀.5. The sample-efficient backbone for the flagship task; supportive in ablations.

Cross-embodiment: OXE-style demonstrations across multiple robot platforms. Removing it significantly degrades unseen-home performance. Active lever.

Web VQA / OCR / detection: internet image-text data inherited from PaliGemma. Surprisingly: ablation is not statistically significant on the main tasks per π₀.5 §5. The most surprising finding in the paper.

Subtask language: roughly 11% of the high-level data, labelled with short natural-language strings. Removing it is the largest single drop in the ablation table. Active lever.

What's not a contribution of π₀.5

Architecture: π₀.5 keeps the same transformer family, the same flow-matching action expert, the same PaliGemma backbone size class. No new attention pattern, no new loss family, no new optimizer. The Knowledge Insulation paper that formalizes the recipe was published one month later as a separate piece; π₀.5's contribution is the data mixture and the curriculum.

Open-world capability: the eval is three homes. The marketing language outpaces the empirical scope. This is the single most important thing to keep straight when reading the paper.