Inference

Running a trained policy

6 min read · for SO-101 users

Once you've trained a policy, you have two ways to run it on your robot: a simple synchronous one (lerobot-eval) and a fast async one (a separate inference server talking to the robot over the network). The async path is the one to use for 30 Hz real-time on big VLAs.

If your robot moves slowly or you're doing sim eval, lerobot-eval is fine. If you're trying to drive an SO-101 at 30 Hz with a Diffusion or π0 policy, the policy itself can take ~100 ms per inference — you need the async path so the robot can keep moving while inference happens in parallel.

Synchronous = walk-stop-think-walk. One program. The robot waits while the policy thinks. Async = the robot keeps moving while a separate computer thinks ahead. Two programs talking over the network.

Synchronous (lerobot-eval):
  +-robot-+   obs    +-policy-+   action   +-robot-+
  |       | -------> | runs   | --------> | moves |
  | wait  |          | ~100ms |            |       |
  +-------+          +--------+            +-------+

Async (RobotClient + PolicyServer):
  +-Robot host-+                    +-GPU host-+
  | RobotClient| <-- gRPC -------> | Server   |
  | executes   |                    | runs     |
  | chunks     |                    | policy   |
  +------------+                    +----------+

The basics

What is a policy doing at inference time?

A policy is the trained brain. At inference time it does one job: look at what the robot sees right now (camera images + joint positions) and output what the robot should do next. Training is over — just forward passes.

Why action chunks?

Most modern policies don't predict one action at a time. They predict a whole chunk of future actions in a single forward pass — ACT and Diffusion typically chunk 8–100 steps; π0 / π0.5 / SmolVLA do the same. The robot replays the chunk one step at a time while the policy thinks about the next chunk in parallel. This is the trick that makes 30 Hz feasible even when one inference call takes ~100 ms.

Synchronous vs async

Synchronous means one program does observation → inference → action in a single loop, waiting at each step. lerobot-eval works this way and it's perfect for sim benchmarks or low-fps tasks. Async splits the work across two processes (usually two machines) connected by a network channel — the robot side keeps issuing actions on schedule, the policy side keeps producing new chunks just-in-time.

What is RTC?

RTC (Real-Time Chunking) is not a policy. It's a flow-matching trick that smooths the seam between consecutive chunks when inference latency is uneven — only relevant for π0 / π0.5 / SmolVLA. You enable it via a config flag; you don't pick "RTC" as a policy type.

Where do I run inference?

For ACT on an SO-101, run inference on the same machine as the robot. For big VLAs (π0, π0.5, SmolVLA), run inference on a separate GPU box and let the async stack ferry observations and actions across the network.

How you actually use it

The simple path — a sim-eval rollout on a trained policy:

# Sim eval on your trained policy lerobot-eval \ --policy.path=outputs/act_run1 \ --env.type=gym_pusht \ --eval.n_episodes=20

The async path — two terminals, one per machine:

# Terminal 1 (GPU host): start the policy server # Async server/client are run as Python modules, not console scripts python -m lerobot.async_inference.policy_server --policy.path=outputs/diffusion_run1 # Terminal 2 (robot host): connect a robot client python -m lerobot.async_inference.robot_client --robot.type=so101_follower --server.host=...

Things to know

Don't drive a real-time policy through lerobot-record

Its rollout loop is synchronous — you'll eat full inference latency on every queue refill and miss your control deadline. Fine for low-fps tasks; broken for 30 Hz on Diffusion / π0 / SmolVLA.

Async transport is plain TCP, no auth

The async stack speaks gRPC (a network protocol for program-to-program calls) over plain TCP — no encryption, no authentication. Co-locate the client and server on the same trusted network.

Default replan threshold is 0.5

The client requests new actions when it has half a chunk left. That's a sensible default but tune it to your task — a slower policy needs the request to fire earlier.

Two extras worth knowing
  • ACT supports temporal ensembling. Predict a chunk every step, blend the overlapping predictions. Useful at 30 Hz if your inference budget allows the every-step call.
  • For pi-family policies, RTC is the cleaner fix for chunk-boundary jitter than temporal ensembling.
  • The async stack is functionally unchanged 0.4.4 → 0.5.0 — just import shuffles. Whatever you build today keeps working.
Optional: under the hood
Show the gRPC architecture (services, RobotClient, PolicyServer)

The async stack is a two-process gRPC system: a RobotClient on the robot host and a PolicyServer on a (typically GPU) inference host. Wire format is gRPC over plain TCP, payloads are pickled Python objects.

Four RPCs in src/lerobot/transport/services.proto:38-45:

  • SendObservations(stream Observation) → Empty
  • GetActions(Empty) → Actions
  • SendPolicyInstructions(PolicySetup) → Empty
  • Ready(Empty) → Empty

Observations stream because they may exceed gRPC's 4 MB message limit (src/lerobot/transport/services.proto:73). Payloads are pickled into services_pb2.Observation chunks (src/lerobot/async_inference/robot_client.py:196-202; helper transport.utils.send_bytes_in_chunks).

Client init at async_inference/robot_client.py:110-114 opens an insecure gRPC channel with grpc_channel_options(initial_backoff=f"{config.environment_dt:.4f}s") and constructs AsyncInferenceStub. Threading on the client uses a threading.Barrier(2) (line 128) between an action-receiver thread (async_inference/robot_client.py:269, pulls action chunks via stub.GetActions(Empty)) and the main control loop (async_inference/robot_client.py:458) which performs actions when available and sends fresh observations whenever queue occupancy drops below chunk_size_threshold (default 0.5 per async_inference/configs.py:137).

Server-side, async_inference/policy_server.py:216-260 (GetActions) blocks on a Queue(maxsize=1), runs _predict_action_chunk (async_inference/policy_server.py:332-407) through full pre/postprocessor pipelines, slices the chunk to actions_per_chunk, pickles, and returns. The server also deliberately time.sleeps to a target inference_latency (async_inference/policy_server.py:256-258; default 1/fps = 0.033 s at async_inference/constants.py:20) to avoid stuffing the client.

When the client's queue empties, it tags the next observation must_go=True (async_inference/robot_client.py:136, 427-437), forcing the server to enqueue and run inference even if the obs is similar to the previous one. Default observation deduplication is an L2-norm-in-joint-space filter at atol=1 (async_inference/helpers.py:280-297).

Show how action chunks are dispatched per policy class

All chunked policies expose predict_action_chunk(batch) → (B, chunk_size, action_dim) plus a select_action(batch) → (B, action_dim) wrapper that maintains an internal _action_queue = collections.deque(maxlen=n_action_steps) and re-plans only when the queue empties.

ACT — deque or temporal ensembler. src/lerobot/policies/act/modeling_act.py:92-122. reset() either initializes a temporal ensembler (if temporal_ensemble_coeff is not None) or allocates the action deque (line 97). With ensembling, every select_action calls predict_action_chunk and feeds the full chunk to ACTTemporalEnsembler.update (lines 109-112). Otherwise the deque is filled once and popped one at a time (lines 116-122). ACTTemporalEnsembler implements Algorithm 2 from the ACT paper with exponential weights w_i = exp(-coeff · i); default coeff=0.01 weights older actions more heavily.

Diffusion — replan-every-N. src/lerobot/policies/diffusion/modeling_diffusion.py:82-139. Observation deques + an action deque of maxlen=n_action_steps. Re-plan triggers on len(_queues[ACTION]) == 0 (line 134). No temporal ensembling.

π0 / π0.5 / SmolVLA use the same deque pattern as Diffusion (src/lerobot/policies/pi05/modeling_pi05.py:1119-1228, src/lerobot/policies/pi0/modeling_pi0.py:1144-1257, src/lerobot/policies/smolvla/modeling_smolvla.py:251-347). select_action raises if RTC is enabled. predict_action_chunk accepts **kwargs: Unpack[ActionSelectKwargs] so RTC's inference_delay / prev_chunk_left_over flow through.

SAC is the odd one out: src/lerobot/policies/sac/modeling_sac.py:79-81 explicitly raises NotImplementedError("SACPolicy does not support action chunking. It returns single actions!"). Relevant for the RL discussion — see the RL-on-top page.

Show what RTC actually does (flow-matching prefix anchoring)

src/lerobot/policies/rtc/README.md:1-6 is unambiguous: RTC is an inference enhancement for flow-matching policies (π0, π0.5, SmolVLA), not a policy itself. It implements Black, Galliker, Levine 2025 ("Real-Time Execution of Action Chunking Flow Policies", arXiv:2506.07339).

Three modules:

  • RTCConfig (src/lerobot/policies/rtc/configuration_rtc.py:30-55): enabled, prefix_attention_schedule ∈ {ZEROS, ONES, LINEAR, EXP} (default LINEAR), max_guidance_weight = 10.0, execution_horizon = 10.
  • RTCProcessor (src/lerobot/policies/rtc/modeling_rtc.py:37) wraps a flow-matching denoise step. The core method, denoise_step(x_t, prev_chunk_left_over, inference_delay, time, ...) (src/lerobot/policies/rtc/modeling_rtc.py:116-248), reverses time (tau = 1 - time, line 164), pads the previous-chunk leftover to current chunk shape (lines 195-198), builds prefix weights (line 204-209), and applies an autograd-based correction: x1_t = x_t - time*v_t, err = (prev_chunk_left_over - x1_t) * weights, correction = autograd.grad(x1_t, x_t, err) (lines 215-218), with corrected velocity v_t - guidance_weight * correction (line 228). Guidance weight is clamp((1-tau)/tau · ((1-tau)² + tau²)/(1-tau)², max_guidance_weight) (lines 220-226). The "inpainting" formulation: the prefix anchors the new chunk to the leftover unexecuted prefix, smoothing the handoff.
  • ActionQueue (src/lerobot/policies/rtc/action_queue.py:35-219): thread-safe queue with two tensors — original_queue (raw, used to compute the next chunk's prev_chunk_left_over) and queue (post-processed, fed to the robot).

In pi0 / pi05 / smolvla, select_action raises if RTC is enabled (comment "RTC is not supported for select_action, use it with predict_action_chunk" at src/lerobot/policies/pi05/modeling_pi05.py:1216-1217, src/lerobot/policies/pi0/modeling_pi0.py:1245-1246, src/lerobot/policies/smolvla/modeling_smolvla.py:332-333). RTC must be driven via predict_action_chunk(batch, inference_delay=…, prev_chunk_left_over=…) from a manual rollout loop — the examples/rtc/eval_with_real_robot.py two-thread pattern is the supported template (a get_actions daemon refills the queue while an actor_control daemon consumes one action per 1/fps tick).

Where to go next →

You've got policies running on the robot. The next question most users ask is whether they can layer reinforcement learning on top to push success rates higher. Can I add RL to my policy? — the direct answer.