The FAST tokenizer is a three-stage deterministic preprocessor that runs once offline and then behaves as a black box. Stage 1 applies a 1D DCT along the time axis of each action dimension, producing a sparse coefficient matrix. Stage 2 applies uniform scalar quantization per coefficient, collapsing most high-frequency entries to zero. Stage 3 runs BPE on the flattened integer stream, merging recurring patterns into a fixed vocabulary of 1024 to 2048 tokens that drops straight into a pretrained VLM's embedding table. Input: 350 floats. Output: 30 to 60 tokens per chunk. Training time drops roughly 5× on comparable tasks; inference, we will address on Page 3.
Sources: FAST §3 (arXiv 2501.09747); Pertsch X thread, Jan 2025.
Viz 4 — DCT energy compaction
Show quantitatively why DCT is the enabling step. Top-16 of 350 coefficients captures 95.1% of signal energy.
50×7 chunk → 350 DCT coefficients. Variance collapses exponentially with frequency index;
top-16 captures 95.1% of the signal energy.
[FAST §3] [fast_deep_dive §1]
A · DCT spectrum — variance per coefficient index (log scale)
For k = 0..49, variance is taken across the 7 DOF dims of dct[k][d]. Bars are log-scaled.
x — coefficient index k (0 = DC) · y — log10 variance over 7 dims
B · Reconstruction from top-K coefficients
Keeps the K largest-magnitude coefficients out of all 350 (50 time × 7 dim), zeros the rest, inverse-DCT.
K = 16 / 350MSE = 0.0000
preset:
Keyboard: ← / → nudge K by 1 · ↑ / ↓ by 5. Click a preset to jump.
original chunk — 7 dims over 50 timesteps
reconstruction — top-K coefficients, inverse DCT
C · JPEG analogy
FAST’s top-K energy compaction on action chunks is structurally the same trick JPEG plays on 8×8 image patches.
Both rely on the DCT’s property that smooth signals — adjacent pixels or adjacent robot poses — concentrate their energy in a few low-frequency coefficients.
Drop the high-frequency tail, quantize the survivors, entropy-code the zeros, and reconstruction is still near-perfect.
At K=16 (95%-energy cutoff) the robot trajectory looks visually identical; this is exactly what JPEG does inside every 8×8 block.
DCT is commonly used for compression algorithms (e.g., JPEG) due to its simplicity, computational efficiency, and strong compression property: since pixels often vary smoothly, DCT can often represent most of the information of an input signal in only a few coefficients.
— [FAST §3]
Run the DCT once per dimension. The output is a T x D matrix of coefficients ordered from DC (index 0) to the highest representable frequency (index T-1). Plot the variance of each coefficient across the training distribution and the curve falls off exponentially: the first few coefficients carry nearly all the signal energy, the tail is near zero. This is the same property JPEG exploits on image patches — and the paper invokes the analogy literally, not rhetorically:
"since pixels often vary smoothly, DCT can often represent most of the information of an input signal in only a few coefficients." (FAST §3)
For smooth motor trajectories the prior holds with little modification. The mechanical consequence is that information has now been sorted by importance. Any fixed-precision quantization applied next will land most bits where they matter and zero out the rest, without handcrafting a Q-table.
Sources: FAST §3 (arXiv 2501.09747).
Viz 5 — Quantization
Show the sparsity that makes BPE efficient. A uniform step on DCT coefficients is implicitly resolution-aware on the time-domain signal.
FAST step 2. A uniform scalar quantizer applied to DCT coefficients is
implicitly non-uniform on the time-domain signal: DC + low harmonics
retain real resolution, high harmonics fall below the step size and round to zero.
[FAST §3], [fast_deep_dive §1]
Before — DCT coefficients (continuous floats)
Shape 50 × 7 (coefficient index k by action dim d). Diverging blue-white-red, centered at 0.
Most magnitude is concentrated on the left side (low-k); high-k columns are near zero.
x = coefficient index k (0 = DC → 49 = highest freq) | y = action dim (0..6)
Most high-frequency coefficients (right side, all rows) round to zero because their
magnitudes fell below the quantization step size. The dense float matrix becomes a
sparse integer matrix — exactly the structure BPE compresses downstream.
After — quantized integers (zeros in black)
Same shape, same colormap. Cells with integer value 0 are painted
black
so sparsity is immediately legible.
x = coefficient index k | y = action dim. Right side should be mostly black.
-max+max= exact zero
0.00% zero
1.0×(larger = coarser step = more zeros, worse reconstruction)
“The DCT coefficient matrix is typically sparse, with most entries being zero and
only a few significant coefficients remaining per action dimension… the BPE step
‘squashes’ the zero-valued components and merges frequently-occurring
coefficient combinations across action dimensions.”
— Pertsch et al., FAST: Efficient Action Tokenization for VLAs, §3
After DCT, apply uniform scalar quantization per coefficient: a fixed scale, round to integer bins. Because the DCT has already sorted energy by frequency, a uniform step on coefficients becomes an implicitly resolution-aware operation on the original signal. Low-frequency coefficients retain real precision; high-frequency coefficients whose magnitude sits below the step size round to exactly zero. After quantization the matrix is dominated by zeros, with a small minority of nonzero low-frequency entries and a handful of mid-range ones. The paper puts the sparsity plainly:
"The DCT coefficient matrix is typically sparse, with most entries being zero and only a few significant coefficients remaining per action dimension." (FAST §3)
This sparsity is the precondition for the next stage. BPE's compression ratio on dense random integers would be mediocre; on a mostly-zero stream with repeating low-frequency patterns, it is substantial.
Sources: FAST §3 (arXiv 2501.09747).
Viz 6 — BPE merge
Why BPE specifically, not Huffman or arithmetic coding. The engineering win is fixed-vocab compatibility with the VLM embedding table.
Byte-pair encoding greedily merges the most-frequent adjacent token pair into a new vocab ID,
then repeats. Applied to FAST's flattened quantized DCT, it drops the chunk from 350 → 66 tokens.
[FAST §3] [fast_deep_dive §1]
“We choose BPE to compress the DCT matrix, since many efficient implementations exist and
it can produce a fixed-size output vocabulary that can be easily integrated into the existing
vocabulary of vision-language models.”
— FAST §3Last clause is the real engineering win. BPE wasn't chosen for compression efficiency —
it was chosen for fixed-vocab compatibility with the VLM embedding table.
Step 0 — initial sequence (350 tokens). Hit Play or press → to start merging.
step0 / 0tokens350unique IDs0merges used0
0
keyboard: ← / → step · space = play/pause
Token count vs merge step
base token (ID 0-255) merge token (ID ≥ 256) pair just merged
Vocabulary
Base tokens (ID 0-255)—
Merge tokens (ID ≥ 256) — 10 most recent0 merges
ID
→ (a, b)
freq
(no merges yet)
Flatten the quantized integer matrix into a 1D sequence. Run BPE — byte-pair encoding trained once on a large action corpus. BPE is algorithmically agnostic to symbol meaning; it sees integers and counts adjacent-pair frequencies, greedily merging the most common pairs into new symbols. Runs of zeros collapse to single tokens; recurring low-frequency patterns that appear across the training distribution become motor-primitive tokens. The paper is direct about why BPE and not a better-compressing code:
"We choose BPE to compress the DCT matrix, since many efficient implementations exist and it can produce a fixed-size output vocabulary that can be easily integrated into the existing vocabulary of vision-language models." (FAST §3)
That last clause is the actual engineering win. Huffman would compress comparably, but BPE emits a fixed-size vocabulary (FAST uses 1024–2048 tokens) whose entries drop cleanly into a pretrained VLM's embedding table. The tokenizer is designed as an interface layer, not just a codec.
Sources: FAST §3 (arXiv 2501.09747); FAST §VI-D ablation notes.
Viz 7 — End-to-end pipeline
Pin down the quantitative claim and set up Page 3's honesty: 10× compression, ~5× training speedup, ~2× slower inference than flow matching.
One chunk (50×7 floats) → DCT → quantize → flatten → BPE → a short token sequence.
[FAST §3] [fast_deep_dive_2026-04-14.md §1]
Compression: 350 floats → 66 tokens (5.3×)
Trains ~5× faster than per-cell tokenization — at the cost of ~2× slower inference vs flow matching.
[FAST §VI-D]
energy compacted into low-k columns (left side) — top-16 coeffs hold 95% of energy
shape 50×7 · floats
uniform scalar quant (per-coef)
3 Quantized integers
= 0non-zero:
black cells = rounded-to-zero (sparse)
350 ints — but 84.6% are zero
integer range ≈ [-20, +20] (int8-safe)
~54 non-zero · int8
flatten (row-major) + BPE merge
4 BPE tokens
1 cell = 1 token ID (hash-colored)
66 tokens — fixed vocab, pastes into VLM embedding table
14 BPE merges applied
1×66 · uint
“The DCT coefficient matrix is typically sparse, with most entries being zero and only a few significant coefficients remaining per action dimension… the BPE step ‘squashes’ the zero-valued components and merges frequently-occurring coefficient combinations across action dimensions.”
— FAST §3
The pipeline collapses 350 correlated floats into ~66 discrete tokens that plug straight into a VLM's embedding table — that's the whole trick. DCT makes the signal sparse in frequency, quantization hardens the sparsity, BPE eats the zero-runs and common patterns.
[FAST §3] [FAST §VI-D] [fast_deep_dive_2026-04-14.md §1]
End to end: a chunk of 50 timesteps by 7 dimensions enters as 350 floats. DCT reshapes it into a sparse 50×7 coefficient matrix. Quantization drops most of it to zero. BPE flattens and merges the result into 30 to 60 tokens drawn from a fixed 1024–2048 vocabulary (FAST paper numbers; FAST blog for chunk-level token count). The headline gains, as reported by PI: roughly 10× compression versus naive per-dim per-timestep binning, and roughly 5× training speedup on comparable tasks (Pertsch X thread, Jan 2025). Two framing notes, kept brief here and expanded on Page 3: the 5× is a training speedup, not inference — inference is in fact slower, at roughly 750 ms per chunk per the FAST paper itself — and the 10× compression comes with codebook utilization around 48%, per FASTer benchmarks. The pipeline is coherent; its deployment profile is where the honest caveats live.
Sources: FAST §3, §VI-D (arXiv 2501.09747); Pertsch X thread, Jan 2025; FASTer (arXiv 2512.04952) for 48% codebook utilization.