Can I add RL to my policy?
7 min read · for SO-101 users with a working IL policy
Not really, no. LeRobot ships a complete HIL-SERL setup (a smart way to train SAC with human help), but you can't load your trained ACT or Diffusion weights into it as a starting point — the SAC actor always begins from random. The π0.6-style "warm-start with IL, fine-tune with RL" pipeline isn't shipped yet.
If your IL policy plateaus at 80% success and you wanted to layer RL on top to push it higher: that path doesn't exist out-of-the-box. You can use HIL-SERL from scratch (with your demos in the replay buffer), or fork ~200 lines of code to build the bridge yourself.
Two mental pictures, side by side — what you wanted, and what's actually in the box.
What you wanted (pi0.6-style):
trained ACT weights ---> SAC actor (warm-start) ---> RL fine-tune
|
v
better policy
What LeRobot ships (HIL-SERL):
random SAC actor ---> rollout on robot ---+
|
human intervenes
|
demos pre-loaded in offline buffer ------+--> SAC training
|
v
good policy (but from scratch)
The basics
What is HIL-SERL?
Human-In-the-Loop Sample-Efficient RL. The training algorithm is SAC (Soft Actor-Critic, a continuous-control RL method); "sample-efficient" comes from three tricks bolted on: a reward classifier (a small net that decides whether each rollout looked like success), demo replay (your teleop demonstrations get pre-loaded into an offline buffer SAC samples from), and human interventions — you sit at the gamepad and grab control when the robot looks like it's about to fail. Those interventions go straight into the replay buffer.
Why doesn't it warm-start from my IL policy?
SAC predicts one action at a time. ACT, Diffusion, π0 and SmolVLA all predict chunks (8-100 actions per forward pass). The output shapes don't match. There's no code to copy weights from a chunk-predicting network into a single-action one — and no config knob saying "start the SAC actor from this checkpoint." The only "pretrained" hook in the SAC stack is the visual backbone, not the action head.
What is QC-FQL?
Q-Chunked Flow Q-Learning. The README labels it "coming soon". The idea is RL that works on action chunks (matching what IL policies output) so you could actually warm-start from ACT or Diffusion weights. As of v0.5.0 there is no code, no PR — just the README mention.
What's actually in src/lerobot/rl/?
Eleven files, roughly 4,629 lines: SAC, replay buffers (online + offline), a gRPC actor-learner split (GPU box trains while the robot box rolls out), the reward classifier, and the gamepad-intervention plumbing. Solid implementation — just always from scratch.
How you actually use it
The standard HIL-SERL recipe is three things in sequence:
Things to know
The full real-robot path is documented (docs/source/hilserl.mdx
is 923 lines), tests cover the gRPC transport end-to-end, and the
intervention path is wired into the leader-arm and gamepad. If
from-scratch SAC with demos and interventions is what you want, this
is the cleanest implementation around.
The default vision encoder
helper2424/resnet10 doesn't load cleanly under the
transformers version that v0.5.0 pins. Stay on v0.4.4 for HIL-SERL
until that's patched, or swap the encoder.
There is no il_warm_start_path or BC-loss config. You'd
write the bridge yourself — the spec is in the deep-dive below
(about ~200 LoC for the residual-SAC variant).
The trick that makes π0.6 work — a critic that scores action chunks instead of single actions — is not in the box. QC-FQL is forward-looking, not shipped.
Not yet. Wait for QC-FQL, or fork the residual-SAC bridge.
Show why SAC can't warm-start from ACT (file evidence)
Three concrete reasons, in code.
(1) The SAC actor is random-init with no IL-bridge config.
SACConfig
(src/lerobot/policies/sac/configuration_sac.py,
243 lines) defines device, vision encoder name (line 127, e.g.
"helper2424/resnet10"),
freeze_vision_encoder (line 129), discount, entropy,
replay-buffer capacity, actor/critic LR. There is no field for
"warm-start from IL checkpoint", "BC loss weight", or "policy-prior
KL". The only "pretrained" hook is the visual backbone —
_load_pretrained_vision_encoder
(policies/sac/modeling_sac.py:936-942).
The actor MLP and critic ensemble heads remain random.
make_policy
(src/lerobot/policies/factory.py:487-491)
honors cfg.pretrained_path only by calling
policy_cls.from_pretrained(...). policy_cls is
whatever cfg.policy.type resolves to, which for an RL run
is "sac". So pretrained_path resumes an SAC
checkpoint; it cannot inject ACT/Diffusion weights. Confirmed by zero
references to ACTPolicy or DiffusionPolicy in
src/lerobot/rl/.
(2) SAC explicitly rejects action chunks. policies/sac/modeling_sac.py:78-81 is a stub:
raise NotImplementedError(
"SACPolicy does not support action chunking. It returns single actions!"
)
The actor outputs a single action per select_action call
(policies/sac/modeling_sac.py:84-98).
SACConfig.action_delta_indices returns None
with the comment "SAC typically predicts one action at a time"
(policies/sac/configuration_sac.py:237-239).
No Q-chunking, full stop.
(3) The README says it explicitly. README.md:106:
Reinforcement Learning | HIL-SERL, TDMPC & QC-FQL (coming soon)
grep -rni "q.chunk|qchunk|fql" across the entire tree
returns one hit — that very README announcement. There is no
Q-chunking, FQL, or chunked-Q file anywhere in v0.5.0.
And in the HIL-SERL example:
examples/tutorial/rl/hilserl_example.py:270
reads policy_actor = SACPolicy(policy_cfg), no
from_pretrained. The actor still starts random.
Show what would need to change to enable pi0.6 (concrete spec)
The load-bearing missing piece is a bridge config + factory hook that takes an IL checkpoint (ACT, Diffusion, π0, SmolVLA), copies matching layers into the SAC actor, optionally freezes the trunk, and reinitializes the critic from scratch. Without that, "warm-start with IL" is impossible regardless of whether Q-chunking is added later.
Concrete file-level changes on top of v0.5.0:
-
Action-chunked SAC actor. Replace
SACActorin policies/sac/modeling_sac.py soforwardreturns[B, H, action_dim]instead of[B, action_dim]. Implementpredict_action_chunkinstead of raisingNotImplementedErrorat line 79-81. Addchunk_size: int = 8toSACConfig. -
Chunked critic. Adapt
CriticEnsemble(inmodeling_sac.py) to take action chunks as input — concatenate or RNN over the chunk. Add a chunk-level Bellman backup inforward(batch, model="critic")(policies/sac/modeling_sac.py:139+). -
Bridge config: load IL weights into SAC actor.
Extend
SACConfigwithil_warm_start_path: Path | None,il_policy_type: str | None(e.g."act","diffusion","pi05"), andfreeze_il_layers: list[str]. Inmake_policy(policies/factory.py:405+), after constructingSACPolicy, ifcfg.il_warm_start_pathis set, load the IL policy viamake_policy(IL_cfg), copy matching layers intopolicy.actor, and reinitialize the critic from scratch. -
Auxiliary BC loss. Add
bc_loss_weight: float = 0.0toSACConfig; inSACPolicy.forward(model="actor"), when sampling the offline buffer (the demo half), addbc_weight * MSE(actor(s), a_demo)to the actor loss. This keeps the warm-started actor from drifting before the critic catches up. -
Optional residual head (smaller fork). Add a residual
MLP that outputs
Δato be added to the IL prior's action; freeze the IL trunk, train only the residual + critic with SAC. This is a smaller, lower-risk variant than full fine-tune — about ~200 LoC, dropped into rl/actor.py:285-298 around thepolicy.select_actioncall. -
SO-101 plumbing. Reward classifier and gRPC
actor-learner work as-is on SO-101. Just verify
HILSerlProcessorConfighas SO-101-specificee_action_bounds(the doc references SO-100 at docs/source/hilserl.mdx:407).
Note on fidelity: the HIL-SERL paper (Luo et al. 2024) doesn't warm-start the SAC actor from a pretrained IL policy either — so LeRobot's implementation is an honest implementation of the paper, not an omission. If you want IL warm-start, you're leaving the paper's recipe and chasing the π0.6 idea instead.
Show the full HIL-SERL recipe with file refs
Eleven files, ~4,629 lines under src/lerobot/rl/ —
the complete HIL-SERL stack.
| File | Role | Lines |
|---|---|---|
actor.py |
Actor process: rollouts on real robot, sends transitions, applies updates from learner | 738 |
learner.py |
Learner process: SAC gradient updates, gRPC server, replay buffer mgmt | 1200 |
buffer.py |
ReplayBuffer (online + offline DrQ-augmented), from_lerobot_dataset constructor |
834 |
gym_manipulator.py |
make_robot_env, RobotEnv Gym wrapper, reward classifier integration |
790 |
learner_service.py |
gRPC service definitions | 117 |
process.py, queue.py, wandb_utils.py |
Multiproc plumbing + logging | 338 |
crop_dataset_roi.py, eval_policy.py, joint_observations_processor.py |
Tooling | 612 |
The HIL-SERL mechanism in v0.5.0 (from examples/tutorial/rl/hilserl_example.py:51-97):
-
online_buffer: every actor transition (autonomous + intervention) is added (line 57). -
offline_buffer: only intervention transitions are added (lines 60-65); also pre-populated from aLeRobotDatasetviaReplayBuffer.from_lerobot_dataset(line 279, rl/buffer.py:415). -
Each SAC update samples
batch_size//2from each buffer and concatenates (lines 73-84). This is the HIL-SERL "demos + interventions" trick.
Components on disk:
- SAC policy + critic ensemble: policies/sac/modeling_sac.py (1064 lines).
-
Reward classifier:
policies/sac/reward_model/modeling_classifier.py
(308 lines) — pretrained vision encoder + classifier head, returns
scalar reward. Wired into
gym_manipulator.pyasRewardClassifierConfig.pretrained_path(docs/source/hilserl.mdx:121, 731). -
Async actor/learner split via gRPC
(rl/actor.py:254,
rl/learner.py:309). Learner streams
updated weights to the actor every
policy_parameters_push_frequency=4SAC steps (policies/sac/configuration_sac.py:54). -
Intervention path:
InterventionActionProcessorStep, gamepad/leader-arm interventions tagged viacomplementary_info["is_intervention"](examples/tutorial/rl/hilserl_example.py:60; docs/source/hilserl.mdx:155-166). -
Comprehensive doc:
docs/source/hilserl.mdx (923 lines) walks
action-space cropping, leader-arm teleop intervention, EE-space
training, ROI image crop, classifier training with
lerobot-train --config_path …reward_classifier_train_config.json(line 718), and launching learner+actor (lines 815-825). Sim variant: docs/source/hilserl_sim.mdx (154 lines). -
Install via the
[hilserl]extra; pullsgym-hil>=0.1.13from HF.
Tests confirm what's officially shipped — and what isn't:
-
tests/rl/test_actor_learner.py:91-300+
covers the gRPC transport
(
test_end_to_end_transitions_flow,test_end_to_end_interactions_flow,test_end_to_end_parameters_flow). All three instantiateSACConfig()from defaults — no IL warm-start tested. -
tests/policies/test_sac_policy.py:82-540
covers forward pass, multi-critic ensemble, save/load round-trip
(
SACPolicy.from_pretrained(root, config=config)round-trips SAC→SAC), and pretrained vision encoder (currently@pytest.mark.skipat line 308). - tests/policies/hilserl/test_modeling_classifier.py covers reward classifier behavior.
No test for IL→RL warm-start because there is no such code path.
The RL surface is functionally identical between v0.4.4 and v0.5.0
— diff -rq returns one trivial diff in
reward_model/configuration_classifier.py and zero diffs
elsewhere.