TRAINING

Training a policy

7 min read · for SO-101 users

When you run lerobot-train, LeRobot loads your dataset, builds a fresh policy network (or resumes one), and runs a standard PyTorch training loop. The 0.5.0 release added one knob (cudnn_deterministic) and silently changed image normalization for π0/π0.5 — your ACT and Diffusion code is unchanged.

If you train ACT or Diffusion on your SO-101 demos, the upgrade does nothing useful. If you're fine-tuning π0 or π0.5, you may need to re-validate your weights — the image preprocessing changed.

Training is teaching by example. The policy watches thousands of recorded demos and learns to predict "what motor command would the human have sent here?"

LeRobotDataset  →  DataLoader  →  Policy.forward(obs)
                                          ↓
                                        Loss
                                          ↓
                                  Optimizer.step()

The basics

What does `lerobot-train` actually do?

Five steps, in order: load the config from your CLI flags → load the dataset (parquet + mp4 on disk, or pulled from the Hub) → build a fresh policy network from scratch (or resume one from a checkpoint) → loop over mini-batches calling policy.forward(batch), backprop, and optimizer step → eval and checkpoint every N steps.

It's a vanilla PyTorch training loop wrapped by HuggingFace's Accelerator. There is nothing exotic happening — if you've trained any HF model before, this is the same shape.

Picking a policy: ACT vs Diffusion vs SmolVLA vs π0/π0.5

You pass --policy.type=<name> at the CLI. The plain-language guide:

ACT — small, fast transformer that predicts a chunk of future actions. Start here for SO-101 — 99% of small-data home setups want this.
Diffusion Policy — slightly better when your demos show more than one valid way to do the task (multimodal behaviour). Same training time ballpark as ACT.
SmolVLA — the smallest vision-language-action model. Language-conditioned ("pick the red cube"). Single-GPU friendly. Use when you want one policy that handles several tasks.
π0 / π0.5 — large vision-language-action models from Physical Intelligence. More compute, best zero-shot generalization. Fine-tune from the Hub checkpoint, don't train from scratch.
VQ-BeT, TD-MPC, SAC, GR00T, WALL-X, XVLA — niche or research-stage. SAC and TD-MPC are RL (skip unless you've read the RL page); GR00T wraps NVIDIA's external model.

What's a config?

Every CLI flag corresponds to a field on a Python dataclass. The configs nest: a top-level TrainPipelineConfig contains a DatasetConfig, a PreTrainedConfig (the policy), an OptimizerConfig, an LRSchedulerConfig, and so on. You override any nested field with dotted CLI flags (--dataset.repo_id=..., --policy.type=act).

Resume vs from-scratch

From-scratch is the default: pass --policy.type=act and a fresh network is initialized. To resume an interrupted run, point at the output folder with --config_path=outputs/your_run --resume=true — LeRobot reloads the optimizer state, scheduler state, and step counter. To fine-tune from a Hub checkpoint instead, use --policy.pretrained_path=<hub-id> — weights load, but training restarts at step 0.

How you actually use it

Train ACT on your dataset:

# Train ACT from scratch on your recorded demos lerobot-train \ --policy.type=act \ --dataset.repo_id=youruser/your-task \ --output_dir=outputs/act_run1

Resume after a crash or pause:

# Pick up where you left off lerobot-train \ --config_path=outputs/act_run1 \ --resume=true

Fine-tune SmolVLA from a Hub checkpoint:

# Start from pretrained weights, train on your data lerobot-train \ --policy.type=smolvla \ --policy.pretrained_path=lerobot/smolvla_base \ --dataset.repo_id=youruser/your-task \ --output_dir=outputs/smolvla_finetune

Things to know

No EMA, no FSDP, no gradient accumulation

The stock trainer is single-process with one Accelerator. Effective batch size is hard-bounded by batch_size × num_processes. To reproduce the original Diffusion Policy paper numbers you need exponential moving average of weights (EMA) — LeRobot doesn't ship one. Add it externally if paper parity matters.

π0 / π0.5 image normalization changed silently

In v0.5.0, image-pad value flipped from -1.0 to 0.0 for π0 and π0.5, shifting normalization range from [−1, 1] to [0, 1]. Re-validate any π0/π0.5 weights you fine-tuned on v0.4.4 before trusting them on v0.5.0. ACT and Diffusion are unaffected.

PEFT/LoRA is the only memory-saver shipped

If you're fine-tuning a big VLA (π0, SmolVLA, WALL-X) on a single GPU, pass --peft.method_type=LORA. There's no other memory-saving path in the trainer — no FSDP, no DeepSpeed, no tensor parallelism.

The "SLURM" script is for SARM, not distributed training

v0.5.0 ships examples/dataset/slurm_compute_rabc.py. It's not a multi-node training launcher — it's an offline annotator that runs the SARM reward model over your dataset. The trainer itself is single-process. (More on SARM in the deep-dive below.)

Optional: under the hood

Show the actual train loop with file:line refs

Entry point: scripts/lerobot_train.py (555 lines in v0.5.0, 551 in v0.4.4). main() calls register_third_party_plugins() then train(), wrapped by @parser.wrap() to feed it a TrainPipelineConfig (scripts/lerobot_train.py:549-551, :153-154).

Init / Accelerator. A single Accelerator with step_scheduler_with_optimizer=False and DistributedDataParallelKwargs(find_unused_parameters=True) (scripts/lerobot_train.py:177-187). torch.backends.cuda.matmul.allow_tf32 = True unconditionally; cudnn.deterministic is config-driven in 0.5.0.
Dataset / env / policy. make_dataset(cfg) rank-0-first (:220-228); make_env(...) rank-0-only (:233-236); make_policy(cfg=cfg.policy, ds_meta=dataset.meta, rename_map=cfg.rename_map) (:240-244); optional policy.wrap_with_peft(...) (:246-250).
Pre / post-processors. make_pre_post_processors(...) (:255-291). preprocessor_overrides["normalizer_processor"]["stats"] = dataset.meta.stats forces fresh dataset stats even when resuming (:266-284).
Optim + scheduler. make_optimizer_and_scheduler (:295); RA-BC weights provider when cfg.use_rabc (:300-318).
Sampler / dataloader. Policies exposing drop_n_last_frames (Diffusion etc.) get an EpisodeAwareSampler over per-episode index ranges; otherwise plain shuffle (:346-368). accelerator.prepare(...) wraps policy/optimizer/dataloader/scheduler.
Train loop (update_policy, scripts/lerobot_train.py:59-150): optional RA-BC per-sample weighting (:96-117); accelerator.backward(loss); clip_grad_norm_ when grad_clip_norm > 0 (:122-130); optimizer.step() under optional lock for SAC's separate optimizers (:133-134); lr_scheduler.step() per batch (:136-140); optional unwrapped policy.update() (TD-MPC's Polyak target update, :143-144).
Eval / save. save_checkpoint(...) writes policy + optimizer + scheduler + processors; update_last_checkpoint() updates the symlink (:457-475); eval_policy_all over dict[suite][task_id] -> vec_env (:477-524); on finish, cfg.policy.push_to_hub pushes the unwrapped model + processors (:535-542).

Hooks present: resume; AMP via accelerator.autocast(); per-sample weighting (RA-BC); gradient clipping; lock-protected optimizer step (SAC). Absent: explicit gradient accumulation; EMA weights; FSDP; DeepSpeed; layer-wise depth-decay LR; checkpoint averaging or SWA.

New in v0.5.0 — cudnn_deterministic

One new field: cudnn_deterministic: bool = False (configs/train.py:53-55). Wired in scripts/lerobot_train.py:212-216: v0.4.4 unconditionally sets cudnn.benchmark = True; v0.5.0 picks deterministic-or-benchmark based on the flag. Docstring says it "may reduce training speed by ~10-20%".

Show all 14 native policies with architecture details

All policies subclass PreTrainedPolicy; each declares a name string used by get_policy_class() (policies/factory.py:60-138) and a config_class. Each policy ships configuration_<x>.py, modeling_<x>.py, processor_<x>.py, and a README.

act — Action Chunking Transformer

DETR-style encoder/decoder over a CNN backbone, optional CVAE latent head, optional online temporal ensembling. Defaults chunk_size=100, n_action_steps=100, n_obs_steps=1 (policies/act/configuration_act.py:85-87). Two-group optimizer preset (backbone vs head) (policies/act/modeling_act.py:71-90). Paper: Zhao et al. arXiv:2304.13705.

diffusion — Diffusion Policy (DDPM/DDIM)

UNet1D over an action chunk conditioned on encoded image+state context, via diffusers. EpisodeAwareSampler drops last drop_n_last_frames to avoid horizon bleed. Paper: Chi et al., IJRR 2024 (policies/diffusion/modeling_diffusion.py:17-21).

vqbet — Behavior Transformer over Residual-VQ-VAE

Two-stage: n_vqvae_training_steps=20000 of constant-LR RVQ pre-training (16 codes, embedding 256), then transformer predicts code indices via Focal loss + offset head (policies/vqbet/configuration_vqbet.py:106-130). Paper: Lee et al. arXiv:2403.03181.

tdmpc — Latent-dynamics MPC + TD learning

Online RL; update() is called every step for Polyak target updates. Paper: Hansen, Wang, Su arXiv:2203.04955; FOWM aug arXiv:2310.16029 (policies/tdmpc/modeling_tdmpc.py:511 for update()).

sac — Soft Actor-Critic

Optional discrete gripper head, optionally shared image encoders. Returns per-component param groups (actor, critic, temperature, optional discrete_critic) for MultiAdamConfig (policies/sac/modeling_sac.py:39-72).

pi0 — π₀ port from OpenPI

PaliGemma VLM + Gemma action expert, conditional flow matching over a 50-step action chunk. Tokenizer length 48 (policies/pi0/configuration_pi0.py:32-97). Paper: arXiv:2410.24164.

pi0_fast — π₀ with FAST discrete action tokenization

DCT-based, decoded via scipy.fftpack.idct. Causal-LM head over PaliGemma; loads HF tokenizer for action detokenization (policies/pi0_fast/modeling_pi0_fast.py:32-44).

pi05 — π₀.₅ port

Differs from π₀: AdaRMS time conditioning via time_mlp_*; tokenizer length 200; discrete state input (no state_proj). Paper: arXiv:2504.16054 (policies/pi05/README.md:35-41).

smolvla — HF's affordable VLA

Pretrained VLM + flow-matching action expert. Paper: arXiv:2506.01844 (policies/smolvla/modeling_smolvla.py:17-50).

groot — wrapper over NVIDIA's `Isaac-GR00T`

GR00TN15.from_pretrained is the model loader; predict_action via model.get_action (policies/groot/modeling_groot.py:46-93). Paper: arXiv:2503.14734 (GR00T N1).

sarm — Stage-Aware Reward Model (NOT a manipulation policy)

StageTransformer for stage class, SubtaskTransformer for within-stage progress. Output feeds RA-BC weighting. Paper: arXiv:2509.25358 (policies/sarm/modeling_sarm.py:18-25).

wall_x — WALL-OSS port

Qwen2.5-VL with MoE action experts; flow-matching or FAST-token action heads. PEFT/LoRA. Paper: arXiv:2509.11766 (policies/wall_x/modeling_wall_x.py:17-89).

xvla — soft-prompted Florence-2-based cross-embodiment VLA

From 2toINF. Bespoke xvla-adamw optimizer with VLM at 1/10 LR (policies/xvla/modeling_xvla.py:270-296).

rtc — Real-Time Chunking (NOT a policy)

Inference-time wrapper for flow-matching policies (π₀ / π₀.₅ / SmolVLA). Paper: Black, Galliker, Levine 2025 (policies/rtc/modeling_rtc.py:17-23).

New in v0.5.0 — pi_gemma shared backbone (NOT a registered policy)

363-line module factored out of pi0/ / pi05/. Provides PiGemmaRMSNorm (AdaRMS), _gated_residual, a decoder-layer factory, PiGemmaModel, PiGemmaForCausalLM, and PaliGemmaForConditionalGenerationWithPiGemma (src/lerobot/policies/pi_gemma.py:54-352). The factory does not route --policy.type=pi_gemma — you can't train it directly. It exists so π₀ / π₀.₅ / π₀-FAST drop a private-symbol dependency on transformers internals.

Show the SARM reward-weighted-BC pipeline

A complete reward-weighted-BC pipeline ships in both versions, hidden behind cfg.use_rabc:

Annotate dataset offline with a SARM model → sarm_progress.parquet per episode (stage class + within-stage τ).
At train time, cfg.use_rabc=True instantiates RABCWeights(progress_path, chunk_size, head_mode, kappa, epsilon, device) (scripts/lerobot_train.py:300-318).
Train loop calls policy.forward(batch, reduction="none"); multiplies per-sample loss by w_i (from the parquet); reduces as (per_sample_loss * w).sum() / (w.sum() + 1e-6) (scripts/lerobot_train.py:96-117).

Five fields in TrainPipelineConfig: use_rabc, rabc_progress_path, rabc_kappa=0.01, rabc_epsilon=1e-6, rabc_head_mode="sparse" (configs/train.py:73-78). Paper: arXiv:2509.25358.

New in v0.5.0 — examples/dataset/slurm_compute_rabc.py

490-line offline annotator built on datatrove.executor.slurm.SlurmPipelineExecutor. Two subcommands:

compute — N workers each compute SARM stage/progress (sparse and/or dense head) for a subset of episodes, writing per-rank parquet shards under rabc_shards/ (examples/dataset/slurm_compute_rabc.py:46-100+).
aggregate — single worker merges shards into sarm_progress.parquet, optionally pushes to the Hub (examples/dataset/slurm_compute_rabc.py:33-36).

The only "SLURM" in the repo is for SARM annotation. Policy training itself never runs under SLURM.

Where to go next →

Where your data lives — how lerobot-record stores demos on disk, what the v3.0 format looks like, and when to re-record vs convert.