Pages 1 and 2 explained what FAST does and how. This page is why the workshop exists.

One-line thesis: FAST-the-architecture retired; FAST-the-tokenizer is load-bearing. π₀-FAST — the AR VLA that ships discrete tokens at runtime — is dormant inside PI. But the DCT + quantize + BPE tokenizer survived: as the pretraining objective in π₀.5, and as the auxiliary discrete target that makes Knowledge Insulation work. Every PI paper after May 2025 assumes a split: discrete tokens at training, continuous flow at runtime. RTC, π₀.5's co-training recipe, π⁊₀.6 — all downstream of that split.

Most FAST talks stop at "JPEG for actions, 5× speedup, neat." This page is the differentiator: the honest evolution, the structural incompatibility with RTC, and the organizing principle FAST left behind. (fast_deep_dive_2026-04-14.md §5.)

Viz 8 — PI stack + FAST-role timeline

Where FAST sits inside PI's runtime architecture (training-time aux head, not at deployment), and how its role migrated from Oct 2024 through Dec 2025.

Viz 8 · PI stack + FAST-role timeline

The PI stack (Knowledge Insulation architecture)

Four boxes, two gradient paths. Solid blue arrows = forward pass. Red "||" = stop-gradient. Dashed yellow = FAST-token CE loss (training-time only, auxiliary head). Green dot = RTC inpainting attachment point on the continuous head. fast_deep_dive §4.1 / §4.3
VLM backbone — PaliGemma 3B params · frozen-ish, gradient-protected via stop-grad [KI arxiv 2505.23705] Action expert — flow matching ~300M params · continuous velocity field · runtime inference ~10 Euler steps / chunk Real robot 50 Hz continuous action stream forward features || stop_grad Aux head FAST-token CE loss training-time only continuous actions RTC inpainting attaches here (continuous head only) [LeRobot PR #1698] FAST = aux target Discrete CE loss → variance reduction for VLM backbone forward pass aux CE loss (training) || stop-gradient RTC attach point
Architecturally: FAST tokens enter only through the auxiliary head at training time. They do not exist at deployment. RTC inpainting plugs into the continuous action expert — which is the only head that produces a denoisable chunk. This is the structural reason RTC cannot rescue π₀-FAST. [fast_deep_dive §4.5]

FAST's role through PI's research timeline

Each node is a paper; the colored chip above it is FAST's role at that moment. The chronology shows FAST migrating from output head (Jan 2025) to aux target / pretraining tokenizer (Apr-May 2025) to not at runtime (Jun 2025 onward).
FAST = output head FAST = aux target / pretraining FAST = retired at runtime
FAST not yet · flow matching Oct 2024 π₀ continuous flow matching arxiv 2410.24164 FAST = output head Jan 2025 π₀-FAST AR on FAST tokens arxiv 2501.09747 FAST = aux target Apr 2025 π₀.5 FAST tokens for pretraining arxiv 2504.16054 FAST = aux target May 2025 KI · Knowledge Insulation CE loss + stop-grad arxiv 2505.23705 FAST = retired at runtime Jun 2025 RTC real-time chunking · flow-only arxiv 2506.07339 FAST = retired at runtime Dec 2025 π⁊₀.6 · FASTer · TT-RTC flow matching at runtime arxiv 2512.04952
FAST-the-tokenizer survived PI's research program and became one of its organizing principles. FAST-the-runtime-architecture did not. The right framing for FAST is the move that made discrete-tokens-at-training, continuous-flow-at-runtime a viable recipe. — fast_deep_dive §5

Architecture

Runtime stack (π₀.5 / KI / π⁊₀.6): PaliGemma-class VLM backbone feeding a small flow-matching action expert (~10 Euler steps, ~100 ms/chunk). Two gradient paths touch the backbone:

  1. Stop-gradient from the action expert. Expert reads features through stop_gradient; noisy flow-matching gradients never propagate into the backbone — the "insulation." (KI §3, arxiv 2505.23705.)
  2. FAST-token CE loss as aux head. Cheap discrete head trained with cross-entropy on the FAST-tokenized chunk. Gradients do flow back — low-variance, categorical, action-aware. Same move as BERT MLM / MaskGIT. (fast_deep_dive §4.3.)

Timeline

  • π₀ (Oct 2024, arxiv 2410.24164) — continuous flow matching; no FAST.
  • π₀-FAST (Jan 2025, arxiv 2501.09747) — FAST as output head. AR VLA emits DCT+BPE tokens.
  • π₀.5 (Apr 2025, arxiv 2504.16054) — FAST as pretraining tokenizer. Pertsch (X): "Training with discrete FAST action tokenization now powers all of our pre-training in π-0.5!" Post-train uses a flow expert.
  • KI (May 2025, arxiv 2505.23705, NeurIPS Spotlight) — FAST as aux CE target. Not at runtime. Per the synthesis in fast_deep_dive §4.3, removing the aux loss appears to drop OOD performance more than removing either stop-gradient or co-training data — attribution from the multi-agent discussion synthesis, not directly a KI paper table [fast_deep_dive §4.3].
  • RTC (Jun 2025, arxiv 2506.07339) — builds on the continuous runtime. Incompatible with AR FAST (Viz 9).
  • π⁊₀.6 / Recap / RLT — inherits KI; FAST remains pretraining/aux signal; runtime flow.

Role across time: output head → aux target / pretraining tokenizer → not at runtime.

Viz 9 — AR vs flow + RTC incompatibility

Side-by-side runtime decode. The LeRobot RTC support matrix and the structural reason it rules out AR FAST.

Viz 9 · AR vs flow runtime, and RTC compatibility

How each runtime decodes one action chunk

AR decodes 60 tokens serially (~12 ms each → ~750 ms). Flow matching denoises the whole chunk in ~10 parallel ODE steps (~100 ms). Run both side-by-side to see the ~7.5× gap. [FAST §VI-D] [KI arxiv 2505.23705]
Starts both columns simultaneously. AR will still be decoding when flow is done.

Autoregressive (π₀-FAST)

60 sequential token boxes appear left-to-right, one per ~12 ms.
TIME: 0 ms  ·  tokens: 0/60
AR decodes serially. 30-60 tokens × ~12 ms/token through the 3B VLM. PI's own measurement: ~750 ms/chunk [FAST §VI-D]. FASTer measured 197-556 ms on LIBERO; 1100-3000 ms on R1Lite whole-body [FASTer arxiv 2512.04952].

Flow matching (π₀ / π₀.5 / SmolVLA)

Whole chunk denoised in 10 parallel ODE steps (~10 ms each).
TIME: 0 ms  ·  steps: 0/10
Flow matching denoises the whole chunk in ~10 parallel ODE steps. PI's measurement: ~100 ms/chunk [KI arxiv 2505.23705] [fast_deep_dive §4.2]. Rectified-flow-style near-straight trajectories converge in few Euler steps.

RTC compatibility

LeRobot RTC PR #1698 support matrix (merged Nov 2025).
π₀
RTC compatible
π₀.5
RTC compatible
SmolVLA
RTC compatible
π₀-FAST
RTC INCOMPATIBLE
Why? RTC inpainting is a denoising operation on continuous action values — it exploits the compositional property of score-based models, so you can condition on a frozen chunk prefix without retraining. AR token decoding has no "partially denoised chunk" to smooth between; its output is a sequence of categorical tokens followed by a non-differentiable decode back to continuous actions. [LeRobot PR #1698] [fast_deep_dive §4.5]

RTC (arxiv 2506.07339) solves a deployment problem: if inference runs 100–300 ms behind real time, chunk transitions stitch jerkily. The trick: treat the boundary as a diffusion inpainting problem — freeze the previous chunk's tail as known context, masked flow/diffusion inpaints a smooth start to the next. No retraining.

Structurally impossible on an AR FAST head (fast_deep_dive §4.5):

  • RTC inpainting is denoising on continuous values. Exploits the compositional property of score-based models — ∇log p(x | c1, c2) decomposes into a sum of conditional scores — so prefixes condition without retraining.
  • AR decoding has no such composition. You can prefix-condition an AR model, but you cannot inpaint a continuous trajectory through it. Output is categorical tokens followed by a non-differentiable decode. No partially-denoised chunk to smooth between.
  • The pathology is documented. STABLE-FAST (OpenReview 0XQSa9CjR7): FAST produces "high variance in inference time and significant physical action instability, causing pauses and jerky motions." BPE yields variable-length token streams for fixed-length chunks — exactly what RTC would smooth, if RTC could touch AR.
  • Confirmed in code. LeRobot's RTC (PR #1698, merged Nov 2025) supports π₀, π₀.5, SmolVLA — all flow-matching. π₀-FAST explicitly unsupported.

Deployment-level reason PI runs continuous flow at runtime — beyond inference speed, whether the rest of the stack (smoothing, latency compensation, chunk stitching) exists for you at all. Discrete-at-train + continuous-at-runtime isn't a compromise; it's the only configuration where the full PI toolkit is available.

Honest-Framing Rails — Workshop Cheat Sheet

Claims to state confidently

Claims to soften

Claims to drop

"FAST opened the door, then the field walked through it."

FAQ

Q: Why not just downsample the action chunk?

A: Downsampling throws energy away uniformly in time. DCT throws it away by frequency — which is the correct basis for smooth signals. Action chunks have most of their energy in the low-frequency DCT coefficients; discarding high-frequency ones (via coarse quantization) loses noise, not signal. Uniform time-domain downsampling loses structure indiscriminately. (pi_workshop_study_guide §2.1 FAQs; fast_deep_dive §1.)

Q: Why not a learned tokenizer like VQ-VAE?

A: FAST is universal and non-learned — trained once on 1M trajectories, reused across embodiments as a deterministic preprocessor. VQ-VAE codes are dataset-specific and require retraining when the embodiment changes, which breaks the "one backbone, many robots" story. The relevant comparison is VQ-VLA (arxiv 2507.01016), which does scale better with data but pays the retraining cost. FAST trades quality for portability. (pi_workshop_study_guide §2.1; fast_deep_dive §1.)

Q: Has FAST held up?

A: In training pipelines, yes. At runtime, no. PI themselves moved FAST out of the runtime path in May 2025. The KI paper (arxiv 2505.23705) states plainly: "An autoregressive VLA (π0-FAST) requires twice the amount of time to solve the task." LeRobot deprecated Pi0FAST after community reproductions hit 3% LIBERO (HF issue #955). The tokenizer survived as an auxiliary pretraining target in KI and π₀.5; the runtime architecture did not. (fast_deep_dive §§3, 5.)

Q: Why can't I deploy π₀-FAST on my UR20 with RTC?

A: RTC smooths chunk boundaries by inpainting a continuous action trajectory — masking the tail/head and denoising through a score-based model. AR token decoding produces a categorical token stream followed by a non-differentiable decode; there is no "partially denoised chunk" to smooth between. LeRobot's RTC (PR #1698, merged Nov 2025) supports π₀, π₀.5, and SmolVLA — not π₀-FAST, by construction. You would also hit STABLE-FAST's variable-length-token pathology: "high variance in inference time and significant physical action instability, causing pauses and jerky motions." (fast_deep_dive §4.5.)

Q: What survived from FAST?

A: FAST tokens as an auxiliary discrete prediction target. In KI, a stop-gradient insulates the VLM backbone from the flow-matching action expert's noisy gradients; a CE loss predicting FAST tokens flows into the backbone and keeps it action-aware with low-variance signal. In π₀.5, FAST-tokenized action prediction is the main pretraining objective across the heterogeneous data mix (Pertsch, X, Apr 2025: "Training with discrete FAST action tokenization now powers all of our pre-training in π-0.5!"). FAST+ (1M trajectories) remains the de facto universal action vocabulary inside PI. (fast_deep_dive §§4.3, 4.4, 5.)

Q: Is DCT specifically the right choice?

A: The paper never ablated it. FAST §VI-D is explicit: "alternatives [to DCT] could be used instead, but we leave this investigation for future work." The BPE ablation shows that even without BPE the pipeline still outperforms naive tokenization — meaning the load-bearing ingredient is discretization + frequency-domain energy compaction, with DCT chosen for simplicity and well-understood behavior rather than proven optimality. Wavelets, Haar transforms, or learned transforms could plausibly match or beat it; the experiment has not been run. (fast_deep_dive §3; pi_workshop_study_guide §1.1.)

Q: What's the 48% codebook utilization about?

A: FASTer (arxiv 2512.04952) reports FAST uses only 48% of its 2,048-token vocabulary, with a single dominant token accounting for ~10% of all activations. FASTer's learned tokenizer uses 100% of a 4,096-token vocabulary. FASTer also notes FAST exhibits "significant shortcomings in reconstruction accuracy — even on in-distribution datasets." The interpretation: BPE's greedy frequency merging is a mediocre match for the quantized-DCT distribution, and much of the apparent compression win is DCT doing the work while BPE largely run-length-encodes zeros. Quote the 10× compression claim with the 48% caveat attached. (fast_deep_dive §3; pi_workshop_study_guide §1.1.)