The FAST tokenizer is a three-stage deterministic preprocessor that runs once offline and then behaves as a black box. Stage 1 applies a 1D DCT along the time axis of each action dimension, producing a sparse coefficient matrix. Stage 2 applies uniform scalar quantization per coefficient, collapsing most high-frequency entries to zero. Stage 3 runs BPE on the flattened integer stream, merging recurring patterns into a fixed vocabulary of 1024 to 2048 tokens that drops straight into a pretrained VLM's embedding table. Input: 350 floats. Output: 30 to 60 tokens per chunk. Training time drops roughly 5× on comparable tasks; inference, we will address on Page 3.

Sources: FAST §3 (arXiv 2501.09747); Pertsch X thread, Jan 2025.

Viz 4 — DCT energy compaction

Show quantitatively why DCT is the enabling step. Top-16 of 350 coefficients captures 95.1% of signal energy.

50×7 chunk → 350 DCT coefficients. Variance collapses exponentially with frequency index; top-16 captures 95.1% of the signal energy. [FAST §3] [fast_deep_dive §1]

A · DCT spectrum — variance per coefficient index (log scale)

For k = 0..49, variance is taken across the 7 DOF dims of dct[k][d]. Bars are log-scaled.
x — coefficient index k (0 = DC) · y — log10 variance over 7 dims

B · Reconstruction from top-K coefficients

Keeps the K largest-magnitude coefficients out of all 350 (50 time × 7 dim), zeros the rest, inverse-DCT.
K = 16 / 350 MSE = 0.0000
preset:
Keyboard: ← / → nudge K by 1 · ↑ / ↓ by 5. Click a preset to jump.
original chunk — 7 dims over 50 timesteps
reconstruction — top-K coefficients, inverse DCT

C · JPEG analogy

FAST’s top-K energy compaction on action chunks is structurally the same trick JPEG plays on 8×8 image patches. Both rely on the DCT’s property that smooth signals — adjacent pixels or adjacent robot poses — concentrate their energy in a few low-frequency coefficients. Drop the high-frequency tail, quantize the survivors, entropy-code the zeros, and reconstruction is still near-perfect. At K=16 (95%-energy cutoff) the robot trajectory looks visually identical; this is exactly what JPEG does inside every 8×8 block.
DCT is commonly used for compression algorithms (e.g., JPEG) due to its simplicity, computational efficiency, and strong compression property: since pixels often vary smoothly, DCT can often represent most of the information of an input signal in only a few coefficients.
— [FAST §3]

Run the DCT once per dimension. The output is a T x D matrix of coefficients ordered from DC (index 0) to the highest representable frequency (index T-1). Plot the variance of each coefficient across the training distribution and the curve falls off exponentially: the first few coefficients carry nearly all the signal energy, the tail is near zero. This is the same property JPEG exploits on image patches — and the paper invokes the analogy literally, not rhetorically:

"since pixels often vary smoothly, DCT can often represent most of the information of an input signal in only a few coefficients." (FAST §3)

For smooth motor trajectories the prior holds with little modification. The mechanical consequence is that information has now been sorted by importance. Any fixed-precision quantization applied next will land most bits where they matter and zero out the rest, without handcrafting a Q-table.

Sources: FAST §3 (arXiv 2501.09747).

Viz 5 — Quantization

Show the sparsity that makes BPE efficient. A uniform step on DCT coefficients is implicitly resolution-aware on the time-domain signal.

FAST step 2. A uniform scalar quantizer applied to DCT coefficients is implicitly non-uniform on the time-domain signal: DC + low harmonics retain real resolution, high harmonics fall below the step size and round to zero. [FAST §3], [fast_deep_dive §1]

Before — DCT coefficients (continuous floats)

Shape 50 × 7 (coefficient index k by action dim d). Diverging blue-white-red, centered at 0. Most magnitude is concentrated on the left side (low-k); high-k columns are near zero.
x = coefficient index k (0 = DC → 49 = highest freq)  |  y = action dim (0..6)
-max+max
Step 2 — uniform scalar quantization (global step = max_abs ÷ 20, int8 range)
Most high-frequency coefficients (right side, all rows) round to zero because their magnitudes fell below the quantization step size. The dense float matrix becomes a sparse integer matrix — exactly the structure BPE compresses downstream.

After — quantized integers (zeros in black)

Same shape, same colormap. Cells with integer value 0 are painted black so sparsity is immediately legible.
x = coefficient index k  |  y = action dim. Right side should be mostly black.
-max+max    = exact zero
0.00% zero
1.0× (larger = coarser step = more zeros, worse reconstruction)
“The DCT coefficient matrix is typically sparse, with most entries being zero and only a few significant coefficients remaining per action dimension… the BPE step ‘squashes’ the zero-valued components and merges frequently-occurring coefficient combinations across action dimensions.” — Pertsch et al., FAST: Efficient Action Tokenization for VLAs, §3

After DCT, apply uniform scalar quantization per coefficient: a fixed scale, round to integer bins. Because the DCT has already sorted energy by frequency, a uniform step on coefficients becomes an implicitly resolution-aware operation on the original signal. Low-frequency coefficients retain real precision; high-frequency coefficients whose magnitude sits below the step size round to exactly zero. After quantization the matrix is dominated by zeros, with a small minority of nonzero low-frequency entries and a handful of mid-range ones. The paper puts the sparsity plainly:

"The DCT coefficient matrix is typically sparse, with most entries being zero and only a few significant coefficients remaining per action dimension." (FAST §3)

This sparsity is the precondition for the next stage. BPE's compression ratio on dense random integers would be mediocre; on a mostly-zero stream with repeating low-frequency patterns, it is substantial.

Sources: FAST §3 (arXiv 2501.09747).

Viz 6 — BPE merge

Why BPE specifically, not Huffman or arithmetic coding. The engineering win is fixed-vocab compatibility with the VLM embedding table.

Byte-pair encoding greedily merges the most-frequent adjacent token pair into a new vocab ID, then repeats. Applied to FAST's flattened quantized DCT, it drops the chunk from 350 → 66 tokens. [FAST §3] [fast_deep_dive §1]

“We choose BPE to compress the DCT matrix, since many efficient implementations exist and it can produce a fixed-size output vocabulary that can be easily integrated into the existing vocabulary of vision-language models.” — FAST §3 Last clause is the real engineering win. BPE wasn't chosen for compression efficiency — it was chosen for fixed-vocab compatibility with the VLM embedding table.
Step 0 — initial sequence (350 tokens). Hit Play or press to start merging.
step0 / 0 tokens350 unique IDs0 merges used0
0
keyboard: ← / → step · space = play/pause
Token count vs merge step
base token (ID 0-255) merge token (ID ≥ 256) pair just merged

Vocabulary

Base tokens (ID 0-255)
Merge tokens (ID ≥ 256) — 10 most recent0 merges
ID→ (a, b)freq
(no merges yet)

Flatten the quantized integer matrix into a 1D sequence. Run BPE — byte-pair encoding trained once on a large action corpus. BPE is algorithmically agnostic to symbol meaning; it sees integers and counts adjacent-pair frequencies, greedily merging the most common pairs into new symbols. Runs of zeros collapse to single tokens; recurring low-frequency patterns that appear across the training distribution become motor-primitive tokens. The paper is direct about why BPE and not a better-compressing code:

"We choose BPE to compress the DCT matrix, since many efficient implementations exist and it can produce a fixed-size output vocabulary that can be easily integrated into the existing vocabulary of vision-language models." (FAST §3)

That last clause is the actual engineering win. Huffman would compress comparably, but BPE emits a fixed-size vocabulary (FAST uses 1024–2048 tokens) whose entries drop cleanly into a pretrained VLM's embedding table. The tokenizer is designed as an interface layer, not just a codec.

Sources: FAST §3 (arXiv 2501.09747); FAST §VI-D ablation notes.

Viz 7 — End-to-end pipeline

Pin down the quantitative claim and set up Page 3's honesty: 10× compression, ~5× training speedup, ~2× slower inference than flow matching.

One chunk (50×7 floats) → DCT → quantize → flatten → BPE → a short token sequence. [FAST §3] [fast_deep_dive_2026-04-14.md §1]

Compression: 350 floats → 66 tokens (5.3×)
Trains ~5× faster than per-cell tokenization — at the cost of ~2× slower inference vs flow matching. [FAST §VI-D]

1 Input chunk

t = 0 → 49 (rows), joints 0-6 (cols, colored palette)
350 floats — 50 timesteps × 7 dims
raw joint-angle trajectory (radians)
shape 50×7 · floats
DCT along
time axis

2 DCT matrix

-…0…+
k = 0 → 49 (freq index, rows), dim 0-6 (cols)
350 floats — same shape, frequency domain
energy compacted into low-k columns (left side) — top-16 coeffs hold 95% of energy
shape 50×7 · floats
uniform scalar
quant (per-coef)

3 Quantized integers

= 0 non-zero:
black cells = rounded-to-zero (sparse)
350 ints — but 84.6% are zero
integer range ≈ [-20, +20] (int8-safe)
~54 non-zero · int8
flatten (row-major)
+ BPE merge

4 BPE tokens

1 cell = 1 token ID (hash-colored)
66 tokens — fixed vocab, pastes into VLM embedding table
14 BPE merges applied
1×66 · uint
“The DCT coefficient matrix is typically sparse, with most entries being zero and only a few significant coefficients remaining per action dimension… the BPE step ‘squashes’ the zero-valued components and merges frequently-occurring coefficient combinations across action dimensions.” — FAST §3
The pipeline collapses 350 correlated floats into ~66 discrete tokens that plug straight into a VLM's embedding table — that's the whole trick. DCT makes the signal sparse in frequency, quantization hardens the sparsity, BPE eats the zero-runs and common patterns. [FAST §3] [FAST §VI-D] [fast_deep_dive_2026-04-14.md §1]

End to end: a chunk of 50 timesteps by 7 dimensions enters as 350 floats. DCT reshapes it into a sparse 50×7 coefficient matrix. Quantization drops most of it to zero. BPE flattens and merges the result into 30 to 60 tokens drawn from a fixed 1024–2048 vocabulary (FAST paper numbers; FAST blog for chunk-level token count). The headline gains, as reported by PI: roughly 10× compression versus naive per-dim per-timestep binning, and roughly 5× training speedup on comparable tasks (Pertsch X thread, Jan 2025). Two framing notes, kept brief here and expanded on Page 3: the 5× is a training speedup, not inference — inference is in fact slower, at roughly 750 ms per chunk per the FAST paper itself — and the 10× compression comes with codebook utilization around 48%, per FASTer benchmarks. The pipeline is coherent; its deployment profile is where the honest caveats live.

Sources: FAST §3, §VI-D (arXiv 2501.09747); Pertsch X thread, Jan 2025; FASTer (arXiv 2512.04952) for 48% codebook utilization.