Training a policy
7 min read · for SO-101 users
When you run lerobot-train, LeRobot loads your dataset, builds a
fresh policy network (or resumes one), and runs a standard PyTorch training
loop. The 0.5.0 release added one knob (cudnn_deterministic) and
silently changed image normalization for π0/π0.5 — your ACT and
Diffusion code is unchanged.
If you train ACT or Diffusion on your SO-101 demos, the upgrade does nothing useful. If you're fine-tuning π0 or π0.5, you may need to re-validate your weights — the image preprocessing changed.
Training is teaching by example. The policy watches thousands of recorded demos and learns to predict "what motor command would the human have sent here?"
LeRobotDataset → DataLoader → Policy.forward(obs)
↓
Loss
↓
Optimizer.step()
The basics
What does lerobot-train actually do?
Five steps, in order: load the config from your CLI flags → load the
dataset (parquet + mp4 on disk, or pulled from the Hub) → build a fresh
policy network from scratch (or resume one from a checkpoint) → loop over
mini-batches calling policy.forward(batch), backprop, and
optimizer step → eval and checkpoint every N steps.
It's a vanilla PyTorch training loop wrapped by HuggingFace's
Accelerator. There is nothing exotic happening — if you've
trained any HF model before, this is the same shape.
Picking a policy: ACT vs Diffusion vs SmolVLA vs π0/π0.5
You pass --policy.type=<name> at the CLI. The plain-language guide:
- ACT — small, fast transformer that predicts a chunk of future actions. Start here for SO-101 — 99% of small-data home setups want this.
- Diffusion Policy — slightly better when your demos show more than one valid way to do the task (multimodal behaviour). Same training time ballpark as ACT.
- SmolVLA — the smallest vision-language-action model. Language-conditioned ("pick the red cube"). Single-GPU friendly. Use when you want one policy that handles several tasks.
- π0 / π0.5 — large vision-language-action models from Physical Intelligence. More compute, best zero-shot generalization. Fine-tune from the Hub checkpoint, don't train from scratch.
- VQ-BeT, TD-MPC, SAC, GR00T, WALL-X, XVLA — niche or research-stage. SAC and TD-MPC are RL (skip unless you've read the RL page); GR00T wraps NVIDIA's external model.
What's a config?
Every CLI flag corresponds to a field on a Python dataclass. The configs nest:
a top-level TrainPipelineConfig contains a
DatasetConfig, a PreTrainedConfig (the policy), an
OptimizerConfig, an LRSchedulerConfig, and so on.
You override any nested field with dotted CLI flags
(--dataset.repo_id=..., --policy.type=act).
Resume vs from-scratch
From-scratch is the default: pass --policy.type=act and a fresh
network is initialized. To resume an interrupted run, point at the output
folder with --config_path=outputs/your_run --resume=true —
LeRobot reloads the optimizer state, scheduler state, and step counter.
To fine-tune from a Hub checkpoint instead, use
--policy.pretrained_path=<hub-id> — weights load, but
training restarts at step 0.
How you actually use it
Train ACT on your dataset:
Resume after a crash or pause:
Fine-tune SmolVLA from a Hub checkpoint:
Things to know
The stock trainer is single-process with one Accelerator.
Effective batch size is hard-bounded by
batch_size × num_processes. To reproduce the original
Diffusion Policy paper numbers you need exponential moving average of
weights (EMA) — LeRobot doesn't ship one. Add it externally if paper
parity matters.
In v0.5.0, image-pad value flipped from -1.0 to
0.0 for π0 and π0.5, shifting normalization range from
[−1, 1] to [0, 1]. Re-validate any π0/π0.5 weights
you fine-tuned on v0.4.4 before trusting them on v0.5.0. ACT and
Diffusion are unaffected.
If you're fine-tuning a big VLA (π0, SmolVLA, WALL-X) on a single GPU,
pass --peft.method_type=LORA. There's no other memory-saving
path in the trainer — no FSDP, no DeepSpeed, no tensor parallelism.
v0.5.0 ships examples/dataset/slurm_compute_rabc.py. It's
not a multi-node training launcher — it's an offline
annotator that runs the SARM reward model over your dataset. The trainer
itself is single-process. (More on SARM in the deep-dive below.)
Show the actual train loop with file:line refs
Entry point: scripts/lerobot_train.py (555
lines in v0.5.0, 551 in v0.4.4). main() calls
register_third_party_plugins() then
train(), wrapped by @parser.wrap() to feed it a
TrainPipelineConfig
(scripts/lerobot_train.py:549-551,
:153-154).
-
Init / Accelerator. A single
Acceleratorwithstep_scheduler_with_optimizer=FalseandDistributedDataParallelKwargs(find_unused_parameters=True)(scripts/lerobot_train.py:177-187).torch.backends.cuda.matmul.allow_tf32 = Trueunconditionally;cudnn.deterministicis config-driven in 0.5.0. -
Dataset / env / policy.
make_dataset(cfg)rank-0-first (:220-228);make_env(...)rank-0-only (:233-236);make_policy(cfg=cfg.policy, ds_meta=dataset.meta, rename_map=cfg.rename_map)(:240-244); optionalpolicy.wrap_with_peft(...)(:246-250). -
Pre / post-processors.
make_pre_post_processors(...)(:255-291).preprocessor_overrides["normalizer_processor"]["stats"] = dataset.meta.statsforces fresh dataset stats even when resuming (:266-284). -
Optim + scheduler.
make_optimizer_and_scheduler(:295); RA-BC weights provider whencfg.use_rabc(:300-318). -
Sampler / dataloader. Policies exposing
drop_n_last_frames(Diffusion etc.) get anEpisodeAwareSamplerover per-episode index ranges; otherwise plain shuffle (:346-368).accelerator.prepare(...)wraps policy/optimizer/dataloader/scheduler. -
Train loop (
update_policy, scripts/lerobot_train.py:59-150): optional RA-BC per-sample weighting (:96-117);accelerator.backward(loss);clip_grad_norm_whengrad_clip_norm > 0(:122-130);optimizer.step()under optional lock for SAC's separate optimizers (:133-134);lr_scheduler.step()per batch (:136-140); optional unwrappedpolicy.update()(TD-MPC's Polyak target update, :143-144). -
Eval / save.
save_checkpoint(...)writes policy + optimizer + scheduler + processors;update_last_checkpoint()updates the symlink (:457-475);eval_policy_alloverdict[suite][task_id] -> vec_env(:477-524); on finish,cfg.policy.push_to_hubpushes the unwrapped model + processors (:535-542).
Hooks present: resume; AMP via
accelerator.autocast(); per-sample weighting (RA-BC); gradient
clipping; lock-protected optimizer step (SAC).
Absent: explicit gradient accumulation; EMA weights; FSDP;
DeepSpeed; layer-wise depth-decay LR; checkpoint averaging or SWA.
cudnn_deterministic
One new field: cudnn_deterministic: bool = False
(configs/train.py:53-55). Wired in
scripts/lerobot_train.py:212-216: v0.4.4
unconditionally sets cudnn.benchmark = True; v0.5.0 picks
deterministic-or-benchmark based on the flag. Docstring says it "may
reduce training speed by ~10-20%".
Show all 14 native policies with architecture details
All policies subclass PreTrainedPolicy; each declares a
name string used by get_policy_class()
(policies/factory.py:60-138) and a
config_class. Each policy ships
configuration_<x>.py, modeling_<x>.py,
processor_<x>.py, and a README.
act — Action Chunking Transformer
DETR-style encoder/decoder over a CNN backbone, optional CVAE latent head,
optional online temporal ensembling. Defaults
chunk_size=100, n_action_steps=100,
n_obs_steps=1
(policies/act/configuration_act.py:85-87).
Two-group optimizer preset (backbone vs head)
(policies/act/modeling_act.py:71-90). Paper:
Zhao et al. arXiv:2304.13705.
diffusion — Diffusion Policy (DDPM/DDIM)
UNet1D over an action chunk conditioned on encoded image+state context, via
diffusers. EpisodeAwareSampler drops last
drop_n_last_frames to avoid horizon bleed. Paper: Chi et al.,
IJRR 2024
(policies/diffusion/modeling_diffusion.py:17-21).
vqbet — Behavior Transformer over Residual-VQ-VAE
Two-stage: n_vqvae_training_steps=20000 of constant-LR RVQ
pre-training (16 codes, embedding 256), then transformer predicts code
indices via Focal loss + offset head
(policies/vqbet/configuration_vqbet.py:106-130).
Paper: Lee et al. arXiv:2403.03181.
tdmpc — Latent-dynamics MPC + TD learning
Online RL; update() is called every step for Polyak target
updates. Paper: Hansen, Wang, Su arXiv:2203.04955; FOWM aug
arXiv:2310.16029
(policies/tdmpc/modeling_tdmpc.py:511 for
update()).
sac — Soft Actor-Critic
Optional discrete gripper head, optionally shared image encoders. Returns
per-component param groups (actor, critic,
temperature, optional discrete_critic) for
MultiAdamConfig
(policies/sac/modeling_sac.py:39-72).
pi0 — π₀ port from OpenPI
PaliGemma VLM + Gemma action expert, conditional flow matching over a 50-step action chunk. Tokenizer length 48 (policies/pi0/configuration_pi0.py:32-97). Paper: arXiv:2410.24164.
pi0_fast — π₀ with FAST discrete action tokenization
DCT-based, decoded via scipy.fftpack.idct. Causal-LM head over
PaliGemma; loads HF tokenizer for action detokenization
(policies/pi0_fast/modeling_pi0_fast.py:32-44).
pi05 — π₀.₅ port
Differs from π₀: AdaRMS time conditioning via
time_mlp_*; tokenizer length 200; discrete state input (no
state_proj). Paper: arXiv:2504.16054
(policies/pi05/README.md:35-41).
smolvla — HF's affordable VLA
Pretrained VLM + flow-matching action expert. Paper: arXiv:2506.01844 (policies/smolvla/modeling_smolvla.py:17-50).
groot — wrapper over NVIDIA's Isaac-GR00T
GR00TN15.from_pretrained is the model loader; predict_action
via model.get_action
(policies/groot/modeling_groot.py:46-93). Paper:
arXiv:2503.14734 (GR00T N1).
sarm — Stage-Aware Reward Model (NOT a manipulation policy)
StageTransformer for stage class,
SubtaskTransformer for within-stage progress. Output feeds RA-BC
weighting. Paper: arXiv:2509.25358
(policies/sarm/modeling_sarm.py:18-25).
wall_x — WALL-OSS port
Qwen2.5-VL with MoE action experts; flow-matching or FAST-token action heads. PEFT/LoRA. Paper: arXiv:2509.11766 (policies/wall_x/modeling_wall_x.py:17-89).
xvla — soft-prompted Florence-2-based cross-embodiment VLA
From 2toINF. Bespoke xvla-adamw optimizer with VLM
at 1/10 LR
(policies/xvla/modeling_xvla.py:270-296).
rtc — Real-Time Chunking (NOT a policy)
Inference-time wrapper for flow-matching policies (π₀ / π₀.₅ / SmolVLA). Paper: Black, Galliker, Levine 2025 (policies/rtc/modeling_rtc.py:17-23).
pi_gemma shared backbone (NOT a registered policy)
363-line module factored out of pi0/ / pi05/.
Provides PiGemmaRMSNorm (AdaRMS), _gated_residual,
a decoder-layer factory, PiGemmaModel,
PiGemmaForCausalLM, and
PaliGemmaForConditionalGenerationWithPiGemma
(src/lerobot/policies/pi_gemma.py:54-352).
The factory does not route
--policy.type=pi_gemma — you can't train it directly. It
exists so π₀ / π₀.₅ / π₀-FAST drop a private-symbol
dependency on transformers internals.
Show the SARM reward-weighted-BC pipeline
A complete reward-weighted-BC pipeline ships in both versions, hidden behind
cfg.use_rabc:
-
Annotate dataset offline with a SARM model →
sarm_progress.parquetper episode (stage class + within-stage τ). -
At train time,
cfg.use_rabc=TrueinstantiatesRABCWeights(progress_path, chunk_size, head_mode, kappa, epsilon, device)(scripts/lerobot_train.py:300-318). -
Train loop calls
policy.forward(batch, reduction="none"); multiplies per-sample loss byw_i(from the parquet); reduces as(per_sample_loss * w).sum() / (w.sum() + 1e-6)(scripts/lerobot_train.py:96-117).
Five fields in TrainPipelineConfig:
use_rabc, rabc_progress_path,
rabc_kappa=0.01, rabc_epsilon=1e-6,
rabc_head_mode="sparse"
(configs/train.py:73-78). Paper:
arXiv:2509.25358.
examples/dataset/slurm_compute_rabc.py
490-line offline annotator built on
datatrove.executor.slurm.SlurmPipelineExecutor. Two
subcommands:
-
compute— N workers each compute SARM stage/progress (sparse and/or dense head) for a subset of episodes, writing per-rank parquet shards underrabc_shards/(examples/dataset/slurm_compute_rabc.py:46-100+). -
aggregate— single worker merges shards intosarm_progress.parquet, optionally pushes to the Hub (examples/dataset/slurm_compute_rabc.py:33-36).
The only "SLURM" in the repo is for SARM annotation. Policy training itself never runs under SLURM.