Pages 1 and 2 explained what FAST does and how. This page is why the workshop exists.
One-line thesis: FAST-the-architecture retired; FAST-the-tokenizer is load-bearing. π₀-FAST — the AR VLA that ships discrete tokens at runtime — is dormant inside PI. But the DCT + quantize + BPE tokenizer survived: as the pretraining objective in π₀.5, and as the auxiliary discrete target that makes Knowledge Insulation work. Every PI paper after May 2025 assumes a split: discrete tokens at training, continuous flow at runtime. RTC, π₀.5's co-training recipe, π⁊₀.6 — all downstream of that split.
Most FAST talks stop at "JPEG for actions, 5× speedup, neat." This page is the differentiator: the honest evolution, the structural incompatibility with RTC, and the organizing principle FAST left behind. (fast_deep_dive_2026-04-14.md §5.)
Viz 8 — PI stack + FAST-role timeline
Where FAST sits inside PI's runtime architecture (training-time aux head, not at deployment), and how its role migrated from Oct 2024 through Dec 2025.
Viz 8 · PI stack + FAST-role timeline
The PI stack (Knowledge Insulation architecture)
Four boxes, two gradient paths. Solid blue arrows = forward pass. Red "||" = stop-gradient. Dashed yellow = FAST-token CE loss (training-time only, auxiliary head). Green dot = RTC inpainting attachment point on the continuous head. fast_deep_dive §4.1 / §4.3
Architecturally: FAST tokens enter only through the auxiliary head at training time. They do not exist at deployment. RTC inpainting plugs into the continuous action expert — which is the only head that produces a denoisable chunk. This is the structural reason RTC cannot rescue π₀-FAST. [fast_deep_dive §4.5]
FAST's role through PI's research timeline
Each node is a paper; the colored chip above it is FAST's role at that moment. The chronology shows FAST migrating from output head (Jan 2025) to aux target / pretraining tokenizer (Apr-May 2025) to not at runtime (Jun 2025 onward).
FAST = output headFAST = aux target / pretrainingFAST = retired at runtime
FAST-the-tokenizer survived PI's research program and became one of its organizing principles. FAST-the-runtime-architecture did not. The right framing for FAST is the move that made discrete-tokens-at-training, continuous-flow-at-runtime a viable recipe. — fast_deep_dive §5
Architecture
Runtime stack (π₀.5 / KI / π⁊₀.6): PaliGemma-class VLM backbone feeding a small flow-matching action expert (~10 Euler steps, ~100 ms/chunk). Two gradient paths touch the backbone:
Stop-gradient from the action expert. Expert reads features through stop_gradient; noisy flow-matching gradients never propagate into the backbone — the "insulation." (KI §3, arxiv 2505.23705.)
FAST-token CE loss as aux head. Cheap discrete head trained with cross-entropy on the FAST-tokenized chunk. Gradients do flow back — low-variance, categorical, action-aware. Same move as BERT MLM / MaskGIT. (fast_deep_dive §4.3.)
π₀-FAST (Jan 2025, arxiv 2501.09747) — FAST as output head. AR VLA emits DCT+BPE tokens.
π₀.5 (Apr 2025, arxiv 2504.16054) — FAST as pretraining tokenizer. Pertsch (X): "Training with discrete FAST action tokenization now powers all of our pre-training in π-0.5!" Post-train uses a flow expert.
KI (May 2025, arxiv 2505.23705, NeurIPS Spotlight) — FAST as aux CE target. Not at runtime. Per the synthesis in fast_deep_dive §4.3, removing the aux loss appears to drop OOD performance more than removing either stop-gradient or co-training data — attribution from the multi-agent discussion synthesis, not directly a KI paper table [fast_deep_dive §4.3].
RTC (Jun 2025, arxiv 2506.07339) — builds on the continuous runtime. Incompatible with AR FAST (Viz 9).
Role across time: output head → aux target / pretraining tokenizer → not at runtime.
Viz 9 — AR vs flow + RTC incompatibility
Side-by-side runtime decode. The LeRobot RTC support matrix and the structural reason it rules out AR FAST.
Viz 9 · AR vs flow runtime, and RTC compatibility
How each runtime decodes one action chunk
AR decodes 60 tokens serially (~12 ms each → ~750 ms). Flow matching denoises the whole chunk in ~10 parallel ODE steps (~100 ms). Run both side-by-side to see the ~7.5× gap. [FAST §VI-D][KI arxiv 2505.23705]
Starts both columns simultaneously. AR will still be decoding when flow is done.
Autoregressive (π₀-FAST)
60 sequential token boxes appear left-to-right, one per ~12 ms.
TIME: 0 ms · tokens: 0/60
AR decodes serially. 30-60 tokens × ~12 ms/token through the 3B VLM. PI's own measurement: ~750 ms/chunk[FAST §VI-D]. FASTer measured 197-556 ms on LIBERO; 1100-3000 ms on R1Lite whole-body[FASTer arxiv 2512.04952].
Flow matching (π₀ / π₀.5 / SmolVLA)
Whole chunk denoised in 10 parallel ODE steps (~10 ms each).
TIME: 0 ms · steps: 0/10
Flow matching denoises the whole chunk in ~10 parallel ODE steps. PI's measurement: ~100 ms/chunk[KI arxiv 2505.23705][fast_deep_dive §4.2]. Rectified-flow-style near-straight trajectories converge in few Euler steps.
RTC compatibility
LeRobot RTC PR #1698 support matrix (merged Nov 2025).
✓
π₀
RTC compatible
✓
π₀.5
RTC compatible
✓
SmolVLA
RTC compatible
✗
π₀-FAST
RTC INCOMPATIBLE
Why? RTC inpainting is a denoising operation on continuous action values — it exploits the compositional property of score-based models, so you can condition on a frozen chunk prefix without retraining. AR token decoding has no "partially denoised chunk" to smooth between; its output is a sequence of categorical tokens followed by a non-differentiable decode back to continuous actions. [LeRobot PR #1698][fast_deep_dive §4.5]
RTC (arxiv 2506.07339) solves a deployment problem: if inference runs 100–300 ms behind real time, chunk transitions stitch jerkily. The trick: treat the boundary as a diffusion inpainting problem — freeze the previous chunk's tail as known context, masked flow/diffusion inpaints a smooth start to the next. No retraining.
Structurally impossible on an AR FAST head (fast_deep_dive §4.5):
RTC inpainting is denoising on continuous values. Exploits the compositional property of score-based models — ∇log p(x | c1, c2) decomposes into a sum of conditional scores — so prefixes condition without retraining.
AR decoding has no such composition. You can prefix-condition an AR model, but you cannot inpaint a continuous trajectory through it. Output is categorical tokens followed by a non-differentiable decode. No partially-denoised chunk to smooth between.
The pathology is documented. STABLE-FAST (OpenReview 0XQSa9CjR7): FAST produces "high variance in inference time and significant physical action instability, causing pauses and jerky motions." BPE yields variable-length token streams for fixed-length chunks — exactly what RTC would smooth, if RTC could touch AR.
Confirmed in code. LeRobot's RTC (PR #1698, merged Nov 2025) supports π₀, π₀.5, SmolVLA — all flow-matching. π₀-FAST explicitly unsupported.
Deployment-level reason PI runs continuous flow at runtime — beyond inference speed, whether the rest of the stack (smoothing, latency compensation, chunk stitching) exists for you at all. Discrete-at-train + continuous-at-runtime isn't a compromise; it's the only configuration where the full PI toolkit is available.
Honest-Framing Rails — Workshop Cheat Sheet
Claims to state confidently
DCT + uniform quantization + BPE is a coherent, effective pipeline for converting smooth action chunks into VLM-compatible tokens (FAST §3).
FAST made autoregressive VLAs competitive with flow-matching VLAs on dexterous tasks at training time — the first tokenizer to do so (FAST §5; π₀-FAST matched π₀ on laundry / table bussing / grocery bagging).
FAST tokens have become the standard auxiliary pretraining target in PI's current architecture (π₀.5 pretraining; KI aux head; inherited by π⁊₀.6). Pertsch confirmed this on X, Apr 2025.
Per the synthesis in fast_deep_dive §4.3, removing the FAST-token auxiliary loss appears to drop π₀.5's OOD performance more than removing either the stop-gradient or the co-training data — though this attribution comes from the multi-agent discussion synthesis, not directly a KI paper ablation table [fast_deep_dive §4.3].
The JPEG analogy is not a metaphor — it is the literal design. DCT + scalar quantization + entropy coding, applied to a time axis instead of a spatial axis (FAST §3).
Structural incompatibility between AR token decoding and RTC inpainting is real, and PI's stack is arranged around it (LeRobot PR #1698 support matrix).
Claims to soften
"5× speedup" → "5× training speedup." Inference is ~2× slower than flow-matched π₀. KI paper (arxiv 2505.23705): "An autoregressive VLA (π0-FAST) requires twice the amount of time to solve the task."
"10× compression" → "10× compression, with ~48% codebook utilization." FASTer (arxiv 2512.04952) shows a single dominant token accounts for ~10% of all activations.
"Universal cross-embodiment tokenizer" → "Cross-embodiment without retraining, but it is debated whether fixed geometric priors limit transfer quality." (ActionCodec critique.)
"FAST fundamentally solves action tokenization" → "FAST opened the problem; FASTer, VQ-VLA, ActionCodec all beat it on reconstruction, codebook usage, or inference speed."
Claims to drop
Any suggestion that DCT specifically is proven optimal. Paper explicitly states (§VI-D): "alternatives [to DCT] could be used instead, but we leave this investigation for future work." BPE ablation shows most of the win is discretization + frequency-domain compression in general; DCT-vs-wavelets/Haar/learned never tested.
Any suggestion that FAST is a good runtime architecture choice going forward. PI retired it. LeRobot deprecated Pi0FAST after community reproductions hit 3% LIBERO (HF issue #955).
LIBERO numbers from community ports quoted without marking them as non-PI-stack. Quote PI's own numbers; note ports separately.
Any framing that positions FAST as replacing flow matching. That framing died in May 2025 with KI.
"FAST opened the door, then the field walked through it."
FAQ
Q: Why not just downsample the action chunk?
A: Downsampling throws energy away uniformly in time. DCT throws it away by frequency — which is the correct basis for smooth signals. Action chunks have most of their energy in the low-frequency DCT coefficients; discarding high-frequency ones (via coarse quantization) loses noise, not signal. Uniform time-domain downsampling loses structure indiscriminately. (pi_workshop_study_guide §2.1 FAQs; fast_deep_dive §1.)
Q: Why not a learned tokenizer like VQ-VAE?
A: FAST is universal and non-learned — trained once on 1M trajectories, reused across embodiments as a deterministic preprocessor. VQ-VAE codes are dataset-specific and require retraining when the embodiment changes, which breaks the "one backbone, many robots" story. The relevant comparison is VQ-VLA (arxiv 2507.01016), which does scale better with data but pays the retraining cost. FAST trades quality for portability. (pi_workshop_study_guide §2.1; fast_deep_dive §1.)
Q: Has FAST held up?
A: In training pipelines, yes. At runtime, no. PI themselves moved FAST out of the runtime path in May 2025. The KI paper (arxiv 2505.23705) states plainly: "An autoregressive VLA (π0-FAST) requires twice the amount of time to solve the task." LeRobot deprecated Pi0FAST after community reproductions hit 3% LIBERO (HF issue #955). The tokenizer survived as an auxiliary pretraining target in KI and π₀.5; the runtime architecture did not. (fast_deep_dive §§3, 5.)
Q: Why can't I deploy π₀-FAST on my UR20 with RTC?
A: RTC smooths chunk boundaries by inpainting a continuous action trajectory — masking the tail/head and denoising through a score-based model. AR token decoding produces a categorical token stream followed by a non-differentiable decode; there is no "partially denoised chunk" to smooth between. LeRobot's RTC (PR #1698, merged Nov 2025) supports π₀, π₀.5, and SmolVLA — not π₀-FAST, by construction. You would also hit STABLE-FAST's variable-length-token pathology: "high variance in inference time and significant physical action instability, causing pauses and jerky motions."(fast_deep_dive §4.5.)
Q: What survived from FAST?
A: FAST tokens as an auxiliary discrete prediction target. In KI, a stop-gradient insulates the VLM backbone from the flow-matching action expert's noisy gradients; a CE loss predicting FAST tokens flows into the backbone and keeps it action-aware with low-variance signal. In π₀.5, FAST-tokenized action prediction is the main pretraining objective across the heterogeneous data mix (Pertsch, X, Apr 2025: "Training with discrete FAST action tokenization now powers all of our pre-training in π-0.5!"). FAST+ (1M trajectories) remains the de facto universal action vocabulary inside PI. (fast_deep_dive §§4.3, 4.4, 5.)
Q: Is DCT specifically the right choice?
A: The paper never ablated it. FAST §VI-D is explicit: "alternatives [to DCT] could be used instead, but we leave this investigation for future work." The BPE ablation shows that even without BPE the pipeline still outperforms naive tokenization — meaning the load-bearing ingredient is discretization + frequency-domain energy compaction, with DCT chosen for simplicity and well-understood behavior rather than proven optimality. Wavelets, Haar transforms, or learned transforms could plausibly match or beat it; the experiment has not been run. (fast_deep_dive §3; pi_workshop_study_guide §1.1.)
Q: What's the 48% codebook utilization about?
A: FASTer (arxiv 2512.04952) reports FAST uses only 48% of its 2,048-token vocabulary, with a single dominant token accounting for ~10% of all activations. FASTer's learned tokenizer uses 100% of a 4,096-token vocabulary. FASTer also notes FAST exhibits "significant shortcomings in reconstruction accuracy — even on in-distribution datasets." The interpretation: BPE's greedy frequency merging is a mediocre match for the quantized-DCT distribution, and much of the apparent compression win is DCT doing the work while BPE largely run-length-encodes zeros. Quote the 10× compression claim with the 48% caveat attached. (fast_deep_dive §3; pi_workshop_study_guide §1.1.)