Before tackling the FAST pipeline, we need to be clear about three things: what an action chunk actually is, why the obvious way to tokenize it does not work, and why the transform axis matters. These three vizzes are the priors you need to have loaded before the math on Page 2 does any work.
Sources: FAST §2, §3, §4 (arXiv 2501.09747); fast_deep_dive_2026-04-14.md §1.
Viz 1 — The chunk itself
Anchor the input before touching the tokenizer. This is what FAST compresses: 350 floats, one second of motion.
1 chunk = 50 timesteps × 7 joint dimensions = 350 floats. At 50 Hz control, this is 1 second of motion.
[FAST §3] [fast_deep_dive_2026-04-14.md §1]
joint dim (0-6)
timestep (0 → 49, 50 Hz)
joint value (rad):-…+
An action chunk is the unit of work a VLA emits. Concretely: a matrix A of shape T x D, where T = 50 timesteps and D = 7 dimensions for a typical single-arm setup (6 joints plus gripper). At a 50 Hz control rate this is exactly one second of motion. Each column is the trajectory of one joint; each row is the full robot pose at one instant. That is the object FAST compresses. The workshop's baseline reference — a diffusion policy on UR20+Robotiq+RealSense — consumes the same shape; the only difference downstream is whether the chunk is emitted as a continuous flow output or routed through a tokenizer first. Keep the shape in mind. Every number in the rest of the pipeline is derived from T*D = 350 floats.
Source: FAST §3 (arXiv 2501.09747).
Viz 2 — Naive tokenization fails
Motivate why "tokenize each dim-timestep directly" is a dead end. The "copy the previous token" pathology made visible.
256
...
Original chunk (same colormap as Viz 1)
timestep (0 → 49)
Naive tokens — integer bin IDs (256 bins)
yellow bar over (t, t+1) where token IDs are identical
“Highly correlated action tokens diminish the effectiveness of the next token prediction objective…
low token prediction loss can often be achieved with trivial mappings like simply copying the most recent action token.”
— FAST §2
Per-cell uniform binning: each (t, d) scalar mapped independently to a bin index.
Adjacent cells in time are nearly equal in value, so they end up in the same bin —
next-token prediction reduces to copying the previous token.
[FAST §2, §4] [fast_deep_dive_2026-04-14.md §1]
Before FAST, autoregressive VLAs (RT-2, OpenVLA) used per-dimension, per-timestep uniform binning: every one of the T*D = 350 scalars became one token from a ~256-bin vocabulary. That looks innocent until you look at the loss curve. Adjacent action values are nearly identical, so next-token prediction collapses to pattern-matching the previous token. The paper states this directly:
"low token prediction loss can often be achieved with trivial mappings like simply copying the most recent action token." (FAST §2)
The model posts a clean loss curve without learning any action structure. The FAST ablation is blunter: standard discretization "could not solve any of the dexterous tasks" in PI's benchmark. The problem is not vocabulary size or encoder capacity; it is that adjacent tokens carry almost no new information, so the AR objective is trivially satisfied.
Sources: FAST §2, §4 (arXiv 2501.09747).
Viz 3 — Why time, why DCT
Explain the two structural choices before showing the pipeline: transform axis (time, not dim) and transform family (DCT, not DFT or wavelet).
Temporal (50×50) vs cross-dim (7×7) Pearson correlation on the same 50×7 chunk.
Threshold filter fades cells with |corr| ≤ 0.5 to reveal structure.
Temporal correlation (50 × 50)
pearson(chunk[i], chunk[j]) across 7 dims · x: timestep j, y: timestep i
mean |corr| off-diagonal:…
| adjacent (i,i+1):…
The thick diagonal band is the signal: poses at nearby timesteps are nearly identical. That is the temporal smoothness DCT-along-time compacts into a few low-frequency coefficients.
Cross-dim correlation (7 × 7)
pearson(col_i, col_j) across 50 timesteps · x: dim j, y: dim i
mean |corr| off-diagonal:…
| adjacent (j,j+1):…
Off-diagonal pattern is scrambled — joint 3 is not geometrically "near" joint 4. No smooth structure along the dim axis to exploit.
-10+1
Adjacent-step correlation ratio:…
Time axis has …× more adjacent correlation than the dim axis.
DCT exploits this — that's why it runs along time, not dim.
(Global mean |corr| off-diagonal is similar for both matrices because they include many unrelated pairs. The structural difference lives next to the diagonal, where smoothness matters.)
Time axis specifically — that's where the redundancy lives. Actions are smooth in time (temporal correlation), not in dimension index (categorical — joint 3 isn't "near" joint 4 geometrically). You don't DCT across dims.
Two decisions precede the math. First, transform along the time axis, not across dimensions. Actions are smooth in time — adjacent timesteps of joint 3 are nearly equal — but they are not smooth across the dimension index. Joint 3 is not geometrically "near" joint 4; the ordering is categorical. Transforming across dims would destroy structure, not concentrate it. Second, use the Discrete Cosine Transform rather than the DFT or a wavelet. DCT implicitly mirror-extends the signal before transforming, which kills the boundary discontinuities the DFT's periodic extension would inject. For smooth trajectories with arbitrary endpoints, mirroring concentrates energy into a handful of low-frequency coefficients. The paper is explicit that the JPEG analogy is the design, not a metaphor:
"DCT is commonly used for compression algorithms (e.g., JPEG) due to its simplicity, computational efficiency, and strong compression property." (FAST §3)
One honest caveat: the authors did not ablate DCT against alternatives. The validated claim is "discretization plus frequency-domain compaction works"; DCT-specifically is the authors' default, not a proven optimum.