π₀ Workshop — Page 3: FAST

Where you are. Page 2 left you with π₀ as a continuous-flow VLA. This page is the controlled detour through the alternative. FAST is best understood as a research arc, not a single result. In January 2025, Physical Intelligence released it as an autoregressive alternative to flow matching — a tokenizer plus an AR runtime head that would let a 3B VLM emit dexterous actions directly. By May 2025, PI had retired the runtime architecture: inference was roughly 2× slower than flow matching, and deployment pathologies stacked up. What stayed load-bearing is the tokenizer itself, now the standard pretraining auxiliary target inside Knowledge Insulation and π₀.5. [FAST paper, arXiv 2501.09747; KI paper, arXiv 2505.23705]

The central twist of this entire site is on this page: the tokenizer survived, the AR runtime did not. Hold that distinction; Pages 4 and 5 depend on it. We hand off the deep tokenizer mechanics to the companion FAST Tokenizer Workshop — that site has the full DCT → quantize → BPE pipeline with interactive math.

Sources: FAST paper (arXiv 2501.09747); KI paper (arXiv 2505.23705); fast_deep_dive_2026-04-14.md.

F2 — The FAST tokenizer in one paragraph

DCT along time → uniform scalar quantization → BPE merge. 350 floats become 30–60 tokens.

FAST is a four-stage pipeline. (1) Take an action chunk of shape [T=50, D=7] — 350 floats, one second of motion. (2) Run the Discrete Cosine Transform along the time axis (not the dimension axis), exploiting the fact that joint trajectories are smooth in time but joint indices are categorical. (3) Apply uniform scalar quantization to the DCT coefficients. (4) Run BPE on the resulting integer stream to merge frequent sub-sequences into ~1024–2048 motor primitives. [FAST §3]

Two facts to take with you. First, the top-16 of 350 DCT coefficients hold roughly 95% of the trajectory's energy — that's why discretizing the coefficients (rather than the raw chunk values) avoids the “every adjacent token is identical” pathology. Second, the JPEG analogy is the design, not a metaphor: DCT-along-time concentrates energy into a few low-frequency coefficients exactly the way DCT-on-image-blocks concentrates energy in JPEG.

The FAST Tokenizer Workshop covers all of this with interactive viz: the chunk, the naive-tokenization failure mode, the DCT-vs-DFT-vs-wavelet ablation, the end-to-end pipeline strip with byte counts. If you want the deep tokenizer story, read it after this page and come back. For our purposes here, the tokenizer is a black box that turns a chunk into 30–60 categorical tokens.

Sources: FAST paper §3 (DCT pipeline); fast_deep_dive_2026-04-14.md §1.

R1 — AR vs flow at the wall clock

A continuous flow head finishes one chunk in ~10 ODE steps. AR over FAST tokens needs ~30–60 sequential decodes.

Two horizontal progress strips. The top strip is π₀-FAST, the AR runtime version: a single chunk requires 30–60 token decodes through the 3B VLM, and each decode is sequential by construction (the next-token prediction objective forbids parallelism). The bottom strip is π₀'s flow head: 10 Euler steps through a 300M action expert, all H=50 action tokens denoised in parallel within each step.

Two structural facts make this gap unbridgeable, not a tuning gap. (1) AR is sequential by construction. Token k+1 needs token k's value, so you cannot parallelize over the time axis. (2) Flow is parallel within a step. All H action tokens are denoised in one forward pass; only the 10 Euler steps are serial, and they're cheap because they share the cached prefix. The KI paper is blunt about it:

“An autoregressive VLA (π₀-FAST) requires twice the amount of time to solve the task… The robot running an autoregressive VLA only just starts to perform the task by the time the robot running our approach has already finished the instruction.” [KI paper / blog, May 2025]

There is also a deployment-level reason flow won. Real-Time Chunking (RTC, arXiv 2506.07339) treats the boundary between consecutive action chunks as a diffusion inpainting problem — freeze the tail of the previous chunk as known context, use masked diffusion to inpaint a smooth start to the next. This trick is mathematically only possible on continuous flow / diffusion heads. There is no “partially denoised chunk” inside an autoregressive token stream. LeRobot's RTC implementation (PR #1698, Nov 2025) supports π₀, π₀.5, SmolVLA — explicitly not π₀-FAST. [fast_deep_dive_2026-04-14.md §5]

Sources: π₀ paper §D Table I; FAST paper §VI-D; KI paper / blog (May 2025); LeRobot RTC PR #1698.

What survived, what didn't

Survived: the FAST tokenizer — DCT + quant + BPE. Used as a training-time auxiliary cross-entropy target on the VLM head in π₀.5 and KI. (Pertsch, April 2025: “Training with discrete FAST action tokenization now powers all of our pre-training in π0.5!”)
Did not survive: the FAST autoregressive runtime architecture (π₀-FAST). Retired by PI before May 2025. The KI paper itself uses π₀-FAST as the “bad baseline” against which KI is compared.
Why this matters: the most common confusion about π₀.5 is that “π₀.5 uses FAST” means “π₀.5 emits FAST tokens at runtime.” It does not. π₀.5 emits a continuous action chunk via flow matching. FAST tokens are predicted by a separate auxiliary head, only during training, and discarded at inference.