Advanced · RL

Can I add RL to my policy?

7 min read · for SO-101 users with a working IL policy

Not really, no. LeRobot ships a complete HIL-SERL setup (a smart way to train SAC with human help), but you can't load your trained ACT or Diffusion weights into it as a starting point — the SAC actor always begins from random. The π0.6-style "warm-start with IL, fine-tune with RL" pipeline isn't shipped yet.

If your IL policy plateaus at 80% success and you wanted to layer RL on top to push it higher: that path doesn't exist out-of-the-box. You can use HIL-SERL from scratch (with your demos in the replay buffer), or fork ~200 lines of code to build the bridge yourself.

Two mental pictures, side by side — what you wanted, and what's actually in the box.

What you wanted (pi0.6-style):
  trained ACT weights ---> SAC actor (warm-start) ---> RL fine-tune
                                                          |
                                                          v
                                                      better policy

What LeRobot ships (HIL-SERL):
  random SAC actor ---> rollout on robot ---+
                                            |
                                    human intervenes
                                            |
   demos pre-loaded in offline buffer ------+--> SAC training
                                                   |
                                                   v
                                               good policy (but from scratch)

The basics

What is HIL-SERL?

Human-In-the-Loop Sample-Efficient RL. The training algorithm is SAC (Soft Actor-Critic, a continuous-control RL method); "sample-efficient" comes from three tricks bolted on: a reward classifier (a small net that decides whether each rollout looked like success), demo replay (your teleop demonstrations get pre-loaded into an offline buffer SAC samples from), and human interventions — you sit at the gamepad and grab control when the robot looks like it's about to fail. Those interventions go straight into the replay buffer.

Why doesn't it warm-start from my IL policy?

SAC predicts one action at a time. ACT, Diffusion, π0 and SmolVLA all predict chunks (8-100 actions per forward pass). The output shapes don't match. There's no code to copy weights from a chunk-predicting network into a single-action one — and no config knob saying "start the SAC actor from this checkpoint." The only "pretrained" hook in the SAC stack is the visual backbone, not the action head.

What is QC-FQL?

Q-Chunked Flow Q-Learning. The README labels it "coming soon". The idea is RL that works on action chunks (matching what IL policies output) so you could actually warm-start from ACT or Diffusion weights. As of v0.5.0 there is no code, no PR — just the README mention.

What's actually in `src/lerobot/rl/`?

Eleven files, roughly 4,629 lines: SAC, replay buffers (online + offline), a gRPC actor-learner split (GPU box trains while the robot box rolls out), the reward classifier, and the gamepad-intervention plumbing. Solid implementation — just always from scratch.

How you actually use it

The standard HIL-SERL recipe is three things in sequence:

# 1) Train a reward classifier on a small labeled dataset lerobot-train \ --config_path=path/to/reward_classifier_train_config.json

# 2) Launch the HIL-SERL learner + actor (separate processes; # invoked as modules, not console scripts — no lerobot-rl-* in pyproject.toml) python -m lerobot.rl.learner --policy.type=sac --dataset.repo_id=youruser/demos python -m lerobot.rl.actor --policy.type=sac --robot.type=so101_follower

# 3) Pick up your gamepad. When the robot looks like it's about to fail, # intervene. Those interventions go straight into the replay buffer.

Things to know

HIL-SERL is genuinely production-grade for SAC-from-scratch

The full real-robot path is documented (docs/source/hilserl.mdx is 923 lines), tests cover the gRPC transport end-to-end, and the intervention path is wired into the leader-arm and gamepad. If from-scratch SAC with demos and interventions is what you want, this is the cleanest implementation around.

The default reward classifier is broken on transformers v5

The default vision encoder helper2424/resnet10 doesn't load cleanly under the transformers version that v0.5.0 pins. Stay on v0.4.4 for HIL-SERL until that's patched, or swap the encoder.

No automatic IL warm-start

There is no il_warm_start_path or BC-loss config. You'd write the bridge yourself — the spec is in the deep-dive below (about ~200 LoC for the residual-SAC variant).

Q-chunking is explicitly absent

The trick that makes π0.6 work — a critic that scores action chunks instead of single actions — is not in the box. QC-FQL is forward-looking, not shipped.

Bottom line for "press button to improve my IL policy"

Not yet. Wait for QC-FQL, or fork the residual-SAC bridge.

Optional: under the hood

Show why SAC can't warm-start from ACT (file evidence)

Three concrete reasons, in code.

(1) The SAC actor is random-init with no IL-bridge config. SACConfig (src/lerobot/policies/sac/configuration_sac.py, 243 lines) defines device, vision encoder name (line 127, e.g. "helper2424/resnet10"), freeze_vision_encoder (line 129), discount, entropy, replay-buffer capacity, actor/critic LR. There is no field for "warm-start from IL checkpoint", "BC loss weight", or "policy-prior KL". The only "pretrained" hook is the visual backbone — _load_pretrained_vision_encoder (policies/sac/modeling_sac.py:936-942). The actor MLP and critic ensemble heads remain random.

make_policy (src/lerobot/policies/factory.py:487-491) honors cfg.pretrained_path only by calling policy_cls.from_pretrained(...). policy_cls is whatever cfg.policy.type resolves to, which for an RL run is "sac". So pretrained_path resumes an SAC checkpoint; it cannot inject ACT/Diffusion weights. Confirmed by zero references to ACTPolicy or DiffusionPolicy in src/lerobot/rl/.

(2) SAC explicitly rejects action chunks. policies/sac/modeling_sac.py:78-81 is a stub:

raise NotImplementedError(
    "SACPolicy does not support action chunking. It returns single actions!"
)

The actor outputs a single action per select_action call (policies/sac/modeling_sac.py:84-98). SACConfig.action_delta_indices returns None with the comment "SAC typically predicts one action at a time" (policies/sac/configuration_sac.py:237-239). No Q-chunking, full stop.

(3) The README says it explicitly. README.md:106:

Reinforcement Learning | HIL-SERL, TDMPC & QC-FQL (coming soon)

grep -rni "q.chunk|qchunk|fql" across the entire tree returns one hit — that very README announcement. There is no Q-chunking, FQL, or chunked-Q file anywhere in v0.5.0.

And in the HIL-SERL example: examples/tutorial/rl/hilserl_example.py:270 reads policy_actor = SACPolicy(policy_cfg), no from_pretrained. The actor still starts random.

Show what would need to change to enable pi0.6 (concrete spec)

The load-bearing missing piece is a bridge config + factory hook that takes an IL checkpoint (ACT, Diffusion, π0, SmolVLA), copies matching layers into the SAC actor, optionally freezes the trunk, and reinitializes the critic from scratch. Without that, "warm-start with IL" is impossible regardless of whether Q-chunking is added later.

Concrete file-level changes on top of v0.5.0:

Action-chunked SAC actor. Replace SACActor in policies/sac/modeling_sac.py so forward returns [B, H, action_dim] instead of [B, action_dim]. Implement predict_action_chunk instead of raising NotImplementedError at line 79-81. Add chunk_size: int = 8 to SACConfig.
Chunked critic. Adapt CriticEnsemble (in modeling_sac.py) to take action chunks as input — concatenate or RNN over the chunk. Add a chunk-level Bellman backup in forward(batch, model="critic") (policies/sac/modeling_sac.py:139+).
Bridge config: load IL weights into SAC actor. Extend SACConfig with il_warm_start_path: Path | None, il_policy_type: str | None (e.g. "act", "diffusion", "pi05"), and freeze_il_layers: list[str]. In make_policy (policies/factory.py:405+), after constructing SACPolicy, if cfg.il_warm_start_path is set, load the IL policy via make_policy(IL_cfg), copy matching layers into policy.actor, and reinitialize the critic from scratch.
Auxiliary BC loss. Add bc_loss_weight: float = 0.0 to SACConfig; in SACPolicy.forward(model="actor"), when sampling the offline buffer (the demo half), add bc_weight * MSE(actor(s), a_demo) to the actor loss. This keeps the warm-started actor from drifting before the critic catches up.
Optional residual head (smaller fork). Add a residual MLP that outputs Δa to be added to the IL prior's action; freeze the IL trunk, train only the residual + critic with SAC. This is a smaller, lower-risk variant than full fine-tune — about ~200 LoC, dropped into rl/actor.py:285-298 around the policy.select_action call.
SO-101 plumbing. Reward classifier and gRPC actor-learner work as-is on SO-101. Just verify HILSerlProcessorConfig has SO-101-specific ee_action_bounds (the doc references SO-100 at docs/source/hilserl.mdx:407).

Note on fidelity: the HIL-SERL paper (Luo et al. 2024) doesn't warm-start the SAC actor from a pretrained IL policy either — so LeRobot's implementation is an honest implementation of the paper, not an omission. If you want IL warm-start, you're leaving the paper's recipe and chasing the π0.6 idea instead.

Show the full HIL-SERL recipe with file refs

Eleven files, ~4,629 lines under src/lerobot/rl/ — the complete HIL-SERL stack.

File	Role	Lines
`actor.py`	Actor process: rollouts on real robot, sends transitions, applies updates from learner	738
`learner.py`	Learner process: SAC gradient updates, gRPC server, replay buffer mgmt	1200
`buffer.py`	`ReplayBuffer` (online + offline DrQ-augmented), `from_lerobot_dataset` constructor	834
`gym_manipulator.py`	`make_robot_env`, `RobotEnv` Gym wrapper, reward classifier integration	790
`learner_service.py`	gRPC service definitions	117
`process.py`, `queue.py`, `wandb_utils.py`	Multiproc plumbing + logging	338
`crop_dataset_roi.py`, `eval_policy.py`, `joint_observations_processor.py`	Tooling	612

The HIL-SERL mechanism in v0.5.0 (from examples/tutorial/rl/hilserl_example.py:51-97):

online_buffer: every actor transition (autonomous + intervention) is added (line 57).
offline_buffer: only intervention transitions are added (lines 60-65); also pre-populated from a LeRobotDataset via ReplayBuffer.from_lerobot_dataset (line 279, rl/buffer.py:415).
Each SAC update samples batch_size//2 from each buffer and concatenates (lines 73-84). This is the HIL-SERL "demos + interventions" trick.

Components on disk:

SAC policy + critic ensemble: policies/sac/modeling_sac.py (1064 lines).
Reward classifier: policies/sac/reward_model/modeling_classifier.py (308 lines) — pretrained vision encoder + classifier head, returns scalar reward. Wired into gym_manipulator.py as RewardClassifierConfig.pretrained_path (docs/source/hilserl.mdx:121, 731).
Async actor/learner split via gRPC (rl/actor.py:254, rl/learner.py:309). Learner streams updated weights to the actor every policy_parameters_push_frequency=4 SAC steps (policies/sac/configuration_sac.py:54).
Intervention path: InterventionActionProcessorStep, gamepad/leader-arm interventions tagged via complementary_info["is_intervention"] (examples/tutorial/rl/hilserl_example.py:60; docs/source/hilserl.mdx:155-166).
Comprehensive doc: docs/source/hilserl.mdx (923 lines) walks action-space cropping, leader-arm teleop intervention, EE-space training, ROI image crop, classifier training with lerobot-train --config_path …reward_classifier_train_config.json (line 718), and launching learner+actor (lines 815-825). Sim variant: docs/source/hilserl_sim.mdx (154 lines).
Install via the [hilserl] extra; pulls gym-hil>=0.1.13 from HF.

Tests confirm what's officially shipped — and what isn't:

tests/rl/test_actor_learner.py:91-300+ covers the gRPC transport (test_end_to_end_transitions_flow, test_end_to_end_interactions_flow, test_end_to_end_parameters_flow). All three instantiate SACConfig() from defaults — no IL warm-start tested.
tests/policies/test_sac_policy.py:82-540 covers forward pass, multi-critic ensemble, save/load round-trip (SACPolicy.from_pretrained(root, config=config) round-trips SAC→SAC), and pretrained vision encoder (currently @pytest.mark.skip at line 308).
tests/policies/hilserl/test_modeling_classifier.py covers reward classifier behavior.

No test for IL→RL warm-start because there is no such code path. The RL surface is functionally identical between v0.4.4 and v0.5.0 — diff -rq returns one trivial diff in reward_model/configuration_classifier.py and zero diffs elsewhere.

Where to go next →

That covers the policy-running and policy-improving stories. Next, the wider context: what other tools and packages live around LeRobot, and which ones matter for an SO-101 hobbyist. The LeRobot ecosystem.