Running a trained policy
6 min read · for SO-101 users
Once you've trained a policy, you have two ways to run it on your robot:
a simple synchronous one (lerobot-eval) and a fast async one
(a separate inference server talking to the robot over the network). The
async path is the one to use for 30 Hz real-time on big VLAs.
If your robot moves slowly or you're doing sim eval, lerobot-eval
is fine. If you're trying to drive an SO-101 at 30 Hz with a Diffusion or
π0 policy, the policy itself can take ~100 ms per inference — you
need the async path so the robot can keep moving while inference happens
in parallel.
Synchronous = walk-stop-think-walk. One program. The robot waits while the policy thinks. Async = the robot keeps moving while a separate computer thinks ahead. Two programs talking over the network.
Synchronous (lerobot-eval): +-robot-+ obs +-policy-+ action +-robot-+ | | -------> | runs | --------> | moves | | wait | | ~100ms | | | +-------+ +--------+ +-------+ Async (RobotClient + PolicyServer): +-Robot host-+ +-GPU host-+ | RobotClient| <-- gRPC -------> | Server | | executes | | runs | | chunks | | policy | +------------+ +----------+
The basics
What is a policy doing at inference time?
A policy is the trained brain. At inference time it does one job: look at what the robot sees right now (camera images + joint positions) and output what the robot should do next. Training is over — just forward passes.
Why action chunks?
Most modern policies don't predict one action at a time. They predict a whole chunk of future actions in a single forward pass — ACT and Diffusion typically chunk 8–100 steps; π0 / π0.5 / SmolVLA do the same. The robot replays the chunk one step at a time while the policy thinks about the next chunk in parallel. This is the trick that makes 30 Hz feasible even when one inference call takes ~100 ms.
Synchronous vs async
Synchronous means one program does observation →
inference → action in a single loop, waiting at each step.
lerobot-eval works this way and it's perfect for sim
benchmarks or low-fps tasks. Async splits the work across
two processes (usually two machines) connected by a network channel
— the robot side keeps issuing actions on schedule, the policy side
keeps producing new chunks just-in-time.
What is RTC?
RTC (Real-Time Chunking) is not a policy. It's a flow-matching trick that smooths the seam between consecutive chunks when inference latency is uneven — only relevant for π0 / π0.5 / SmolVLA. You enable it via a config flag; you don't pick "RTC" as a policy type.
Where do I run inference?
For ACT on an SO-101, run inference on the same machine as the robot. For big VLAs (π0, π0.5, SmolVLA), run inference on a separate GPU box and let the async stack ferry observations and actions across the network.
How you actually use it
The simple path — a sim-eval rollout on a trained policy:
The async path — two terminals, one per machine:
Things to know
lerobot-recordIts rollout loop is synchronous — you'll eat full inference latency on every queue refill and miss your control deadline. Fine for low-fps tasks; broken for 30 Hz on Diffusion / π0 / SmolVLA.
The async stack speaks gRPC (a network protocol for program-to-program calls) over plain TCP — no encryption, no authentication. Co-locate the client and server on the same trusted network.
The client requests new actions when it has half a chunk left. That's a sensible default but tune it to your task — a slower policy needs the request to fire earlier.
- ACT supports temporal ensembling. Predict a chunk every step, blend the overlapping predictions. Useful at 30 Hz if your inference budget allows the every-step call.
- For pi-family policies, RTC is the cleaner fix for chunk-boundary jitter than temporal ensembling.
- The async stack is functionally unchanged 0.4.4 → 0.5.0 — just import shuffles. Whatever you build today keeps working.
Show the gRPC architecture (services, RobotClient, PolicyServer)
The async stack is a two-process gRPC system: a RobotClient
on the robot host and a PolicyServer on a (typically GPU)
inference host. Wire format is gRPC over plain TCP, payloads are pickled
Python objects.
Four RPCs in src/lerobot/transport/services.proto:38-45:
SendObservations(stream Observation) → EmptyGetActions(Empty) → ActionsSendPolicyInstructions(PolicySetup) → EmptyReady(Empty) → Empty
Observations stream because they may exceed gRPC's 4 MB message limit
(src/lerobot/transport/services.proto:73).
Payloads are pickled into services_pb2.Observation chunks
(src/lerobot/async_inference/robot_client.py:196-202;
helper transport.utils.send_bytes_in_chunks).
Client init at
async_inference/robot_client.py:110-114
opens an insecure gRPC channel with
grpc_channel_options(initial_backoff=f"{config.environment_dt:.4f}s")
and constructs AsyncInferenceStub. Threading on the client
uses a threading.Barrier(2) (line 128) between an
action-receiver thread
(async_inference/robot_client.py:269,
pulls action chunks via stub.GetActions(Empty)) and the main
control loop
(async_inference/robot_client.py:458) which
performs actions when available and sends fresh observations whenever
queue occupancy drops below chunk_size_threshold
(default 0.5 per
async_inference/configs.py:137).
Server-side,
async_inference/policy_server.py:216-260
(GetActions) blocks on a Queue(maxsize=1),
runs _predict_action_chunk
(async_inference/policy_server.py:332-407)
through full pre/postprocessor pipelines, slices the chunk to
actions_per_chunk, pickles, and returns. The server also
deliberately time.sleeps to a target
inference_latency
(async_inference/policy_server.py:256-258;
default 1/fps = 0.033 s at
async_inference/constants.py:20) to avoid
stuffing the client.
When the client's queue empties, it tags the next observation
must_go=True
(async_inference/robot_client.py:136, 427-437),
forcing the server to enqueue and run inference even if the obs is
similar to the previous one. Default observation deduplication is an
L2-norm-in-joint-space filter at atol=1
(async_inference/helpers.py:280-297).
Show how action chunks are dispatched per policy class
All chunked policies expose
predict_action_chunk(batch) → (B, chunk_size, action_dim)
plus a select_action(batch) → (B, action_dim) wrapper
that maintains an internal
_action_queue = collections.deque(maxlen=n_action_steps)
and re-plans only when the queue empties.
ACT — deque or temporal ensembler.
src/lerobot/policies/act/modeling_act.py:92-122.
reset() either initializes a temporal ensembler (if
temporal_ensemble_coeff is not None) or allocates the
action deque (line 97). With ensembling, every select_action
calls predict_action_chunk and feeds the full chunk to
ACTTemporalEnsembler.update (lines 109-112). Otherwise the
deque is filled once and popped one at a time (lines 116-122).
ACTTemporalEnsembler implements Algorithm 2 from the ACT
paper with exponential weights w_i = exp(-coeff · i);
default coeff=0.01 weights older actions more heavily.
Diffusion — replan-every-N.
src/lerobot/policies/diffusion/modeling_diffusion.py:82-139.
Observation deques + an action deque of maxlen=n_action_steps.
Re-plan triggers on len(_queues[ACTION]) == 0 (line 134).
No temporal ensembling.
π0 / π0.5 / SmolVLA use the same deque pattern as
Diffusion
(src/lerobot/policies/pi05/modeling_pi05.py:1119-1228,
src/lerobot/policies/pi0/modeling_pi0.py:1144-1257,
src/lerobot/policies/smolvla/modeling_smolvla.py:251-347).
select_action raises if RTC is enabled.
predict_action_chunk accepts
**kwargs: Unpack[ActionSelectKwargs] so RTC's
inference_delay / prev_chunk_left_over flow
through.
SAC is the odd one out:
src/lerobot/policies/sac/modeling_sac.py:79-81
explicitly raises
NotImplementedError("SACPolicy does not support action chunking. It returns single actions!").
Relevant for the RL discussion — see the
RL-on-top page.
Show what RTC actually does (flow-matching prefix anchoring)
src/lerobot/policies/rtc/README.md:1-6 is unambiguous: RTC is an inference enhancement for flow-matching policies (π0, π0.5, SmolVLA), not a policy itself. It implements Black, Galliker, Levine 2025 ("Real-Time Execution of Action Chunking Flow Policies", arXiv:2506.07339).
Three modules:
-
RTCConfig(src/lerobot/policies/rtc/configuration_rtc.py:30-55):enabled,prefix_attention_schedule ∈ {ZEROS, ONES, LINEAR, EXP}(defaultLINEAR),max_guidance_weight = 10.0,execution_horizon = 10. -
RTCProcessor(src/lerobot/policies/rtc/modeling_rtc.py:37) wraps a flow-matching denoise step. The core method,denoise_step(x_t, prev_chunk_left_over, inference_delay, time, ...)(src/lerobot/policies/rtc/modeling_rtc.py:116-248), reverses time (tau = 1 - time, line 164), pads the previous-chunk leftover to current chunk shape (lines 195-198), builds prefix weights (line 204-209), and applies an autograd-based correction:x1_t = x_t - time*v_t,err = (prev_chunk_left_over - x1_t) * weights,correction = autograd.grad(x1_t, x_t, err)(lines 215-218), with corrected velocityv_t - guidance_weight * correction(line 228). Guidance weight isclamp((1-tau)/tau · ((1-tau)² + tau²)/(1-tau)², max_guidance_weight)(lines 220-226). The "inpainting" formulation: the prefix anchors the new chunk to the leftover unexecuted prefix, smoothing the handoff. -
ActionQueue(src/lerobot/policies/rtc/action_queue.py:35-219): thread-safe queue with two tensors —original_queue(raw, used to compute the next chunk'sprev_chunk_left_over) andqueue(post-processed, fed to the robot).
In pi0 / pi05 / smolvla, select_action raises if RTC is
enabled (comment "RTC is not supported for select_action, use it with
predict_action_chunk" at
src/lerobot/policies/pi05/modeling_pi05.py:1216-1217,
src/lerobot/policies/pi0/modeling_pi0.py:1245-1246,
src/lerobot/policies/smolvla/modeling_smolvla.py:332-333).
RTC must be driven via
predict_action_chunk(batch, inference_delay=…, prev_chunk_left_over=…)
from a manual rollout loop — the
examples/rtc/eval_with_real_robot.py two-thread pattern is
the supported template (a get_actions daemon refills the
queue while an actor_control daemon consumes one action per
1/fps tick).