Ecosystem

What plugs in

7 min read · for SO-101 users

LeRobot has first-class support for big foundation models from HuggingFace and NVIDIA — SmolVLA, GR00T, π0 / π0.5 — and a few research integrations like the Unitree G1's whole-body locomotion. But popular research tools like V-JEPA, DINO, and UMI are not in the codebase. You'd swap an encoder by forking the policy file.

What this means for you: if you want to fine-tune SmolVLA or GR00T on your SO-101 demos, that just works — one extra to install, one config flag, go. If you wanted to plug a V-JEPA visual encoder into ACT instead of its default ResNet, you'd be writing a fork.

SHIPPED IN v0.5.0:
  ┌──────────────┐ ┌─────────────┐ ┌─────────────┐
  │   SmolVLA    │ │  GR00T VLA  │ │ π0 / π0.5   │
  │  (450M, HF)  │ │   (NVIDIA)  │ │ (PaliGemma) │
  └──────────────┘ └─────────────┘ └─────────────┘

  ┌──────────────┐ ┌─────────────┐
  │  GR00T-WBC   │ │  Holosoma   │   (G1 humanoid only)
  │ (locomotion) │ │ (locomotion)│
  └──────────────┘ └─────────────┘

NOT IN THE CODE — fork-only:
  ▢ V-JEPA / DINO encoders
  ▢ UMI ingestion (Stanford handheld)
  ▢ Diffusion world models for rollout / planning

Top row is what an SO-101 user can actually use. Middle row is humanoid-only. Bottom row is the work you'd own if you wanted it.

The basics

What's a VLA?

A VLA is a Vision-Language-Action model: a big transformer that takes camera frames plus a text instruction (“put the red block in the bowl”) and outputs robot actions. SmolVLA, GR00T, and π0 are all VLAs — same family, different sizes and pre-training recipes.

What ships natively?

“Native” here means the code is in the repo, a pip install extra brings it in, and a config flag turns it on.

SmolVLA (450M parameters, HuggingFace) — the friendliest VLA for fine-tuning on a single consumer GPU. Closest thing to a “weekend VLA” on SO-101. Pretrained checkpoint lerobot/smolvla_base.
GR00T N1.5 (3B, NVIDIA) — manipulation only, more compute, stronger zero-shot. Pretrained checkpoint nvidia/GR00T-N1.5-3B. Wrapper delegates to NVIDIA's Isaac-GR00T runner.
π0 / π0.5 / π0-FAST — the SOTA tier, large models, PaliGemma backbone. Pretrained lerobot/pi05_base.
GR00T-WBC + Holosoma — locomotion ONNX policies for the Unitree G1 humanoid. Not relevant for arm-only setups.

What's missing?

V-JEPA, V-JEPA-2, DINOv2/DINOv3 (research-grade visual encoders), UMI (Stanford's handheld GoPro data-collection pipeline), and diffusion world models (learned-dynamics planners) are all absent from the codebase. A repo-wide grep for any of these terms returns zero hits in src/.

Sims

Four optional sim extras: gym-aloha (bimanual ALOHA), gym-pusht (Push-T benchmark), metaworld (manipulation suite), hf-libero (LIBERO, Linux only). None simulate SO-101. For SO-101 in sim you pull a community env from the HuggingFace hub — not an official one.

How you actually use it

Fine-tune SmolVLA on your demos — the most plug-and-play VLA path:

# Install once pip install -e ".[smolvla]" # Fine-tune on your dataset (single consumer GPU) lerobot-train \ --policy.type=smolvla \ --policy.pretrained_path=lerobot/smolvla_base \ --dataset.repo_id=youruser/your-task

Fine-tune GR00T N1.5 — bigger model, better zero-shot, more compute:

pip install -e ".[groot]" lerobot-train \ --policy.type=groot \ --policy.pretrained_path=nvidia/GR00T-N1.5-3B \ --policy.embodiment_tag=new_embodiment \ --dataset.repo_id=youruser/your-task

embodiment_tag=new_embodiment tells GR00T “this is a new arm shape, learn to map onto it.”

Things to know

ACT and Diffusion are ResNet-locked

Both ACT and Diffusion Policy hard-validate the vision backbone as one of the ResNet variants. Swapping in V-JEPA, DINO, or any other encoder is not a config flag — it's a fork patch. Expand the deep-dive below for the exact files to edit.

huggingface-cli is now hf

v0.5.0 bumped huggingface-hub from 0.34 to 1.0 (a major version jump). The huggingface-cli binary was renamed to hf. If you have shell scripts that call huggingface-cli upload …, they break on upgrade.

For SO-101 / arm-only setups

SmolVLA fine-tune is the most plug-and-play path; GR00T second (more compute, better baseline). Skip GR00T-WBC and Holosoma — they're locomotion controllers gated to the Unitree G1. For G1 locomotion users, both are shipping ONNX policies — no PyTorch fine-tune needed; pick one via --robot.controller=….

pi_gemma is not a policy you can train

v0.5.0 added a file called pi_gemma.py that looks like a new policy. It isn't — it's a shared backbone module (PaliGemma transformer internals) factored out of π0 and π0.5. --policy.type=pi_gemma is not registered. Use --policy.type=pi05 as before.

Optional: under the hood

Show the full native vs bridge integration table (with hub IDs and file refs)

Integration	Scope	Install path	File:line evidence
SmolVLA	VLA policy. Backbone: HF `SmolVLM2-500M-Video-Instruct`. Pretrained weights: `lerobot/smolvla_base`.	`pip install -e ".[smolvla]"`	policies/smolvla/configuration_smolvla.py:87 policies/smolvla/smolvlm_with_expert.py:64 docs/source/smolvla.mdx:49,58
GR00T (manipulation VLA)	Wrapper around Isaac-GR00T. Default checkpoint `nvidia/GR00T-N1.5-3B`. Eagle 2.5-VL VLM in-tree. Already in 0.4.4 — the `groot/` policy folders diff identically.	`pip install -e ".[groot]"`	policies/groot/configuration_groot.py:58,61,64 policies/groot/modeling_groot.py:20-22 pyproject.toml:149-159
π0 / π0.5 / π0-FAST	SOTA-tier VLAs. PaliGemma 3B backbone, 224-px input. Pretrained `lerobot/pi05_base`, `lerobot/pi05_libero`.	`pip install -e ".[pi]"`	policies/pi05/processor_pi05.py:141 policies/pi0/processor_pi0.py:135 docs/source/pi05.mdx:78-80
GR00T-WBC (locomotion) — NEW in 0.5.0	Dual-ONNX lower-body locomotion stack for Unitree G1. Pulls `nepyope/GR00T-WholeBodyControl_g1` from HF. Runs at 50 Hz.	Loaded via `--robot.controller=GrootLocomotionController`	robots/unitree_g1/gr00t_locomotion.py:41,48,62-69 docs/source/unitree_g1.mdx:158,254
Holosoma — NEW in 0.5.0	Single-ONNX whole-body locomotion controller, 200 Hz, 29 DoF. Origin: Amazon FAR (`github.com/amazon-far/holosoma`). FastSAC default; PPO variant available.	Loaded via `--robot.controller=HolosomaLocomotionController`	robots/unitree_g1/holosoma_locomotion.py:45,52,55-58 docs/source/unitree_g1.mdx:161,255
`pi_gemma` shared backbone — NEW in 0.5.0	Shared module for π0 / π0.5 / π0-FAST. PaliGemma 3B at 224 hardcoded as the tokenizer for all three. Not a registered policy — refactor target only.	Pulled in by `[pi]`, `[wallx]`, etc.	policies/pi_gemma.py:24-39,54,69 policies/pi05/processor_pi05.py:141 policies/pi0_fast/configuration_pi0_fast.py:56
Wall-OSS / X-VLA / SARM	Other VLA variants (Wall-OSS = X-Square Robot, X-VLA = HF research, SARM = stage-aware reward model with CLIP).	`pip install -e ".[wallx\|xvla\|sarm]"`	pyproject.toml:140-161 policies/sarm/processor_sarm.py:94-95
Diffusers	Hard-required runtime dep. DDIM/DDPM in Diffusion Policy; attention/embedding primitives in GR00T action head; LR scheduler in optim.	Always installed	pyproject.toml:64 policies/diffusion/modeling_diffusion.py:32-33 policies/groot/action_head/cross_attention_dit.py:19-22
HF Hub	Hard dep. Every checkpoint, dataset, env, and the new ONNX policy downloads (Holosoma, GR00T-WBC). Bumped from `>=0.34.2` to `>=1.0.0` — major-version jump.	Always installed	pyproject.toml:65
Sim envs (gym-aloha, gym-pusht, metaworld, hf-libero)	External HF-org packages declared as optional extras. No SO-101 sim shipped — use EnvHub LeIsaac (community-published).	`pip install -e ".[aloha\|pusht\|metaworld\|libero]"`	pyproject.toml:175-178 docs/source/envhub_leisaac.mdx

IsaacLab Arena is an in-tree config option (--env.type=hub with default nvidia/isaaclab-arena-envs) but it pulls envs from the HuggingFace hub, not from a vendored sim. Verify any community env you pull.

Show GR00T-WBC and Holosoma details (Unitree G1 only)

v0.5.0 ships two distinct GR00T-branded artifacts that share only the brand. The manipulation VLA (policy.type=groot) is unchanged from 0.4.4. The locomotion controllers (--robot.controller=…) are new in 0.5.0 and apply only to the G1 humanoid.

GR00T-WBC

NVIDIA · dual-ONNX · 50 Hz

Two ONNX files: Balance + Walk.
CONTROL_DT = 0.02 → 50 Hz. gr00t_locomotion.py:41
15-DoF action.
6-step stacked observation history.
3-vector velocity command [vx, vy, theta_dot].
Default checkpoint: nepyope/GR00T-WholeBodyControl_g1.
Origin: github.com/NVlabs/GR00T-WholeBodyControl.

Holosoma

Amazon FAR · single-ONNX · 200 Hz

One ONNX file (default fastsac_g1_29dof.onnx; PPO variant ppo_g1_29dof.onnx available).
CONTROL_DT = 0.005 → 200 Hz.
Drives 29 DoF directly.
Phase variable [0, π] with GAIT_PERIOD = 0.5 s.
kp / kd joint gains pulled from ONNX metadata fields — gains travel with the policy.
Default checkpoint: nepyope/holosoma_locomotion.
Origin: github.com/amazon-far/holosoma [external claim].

The zero-arm-obs trick

Holosoma drops the upper-body arm joints from the observation vector before feeding the lower-body locomotion ONNX (robots/unitree_g1/holosoma_locomotion.py:161-163). The pattern (FastSAC + PPO with phase-conditioned gait, zero-arm-obs masking) is consistent with sim-to-real RL locomotion stacks; the source-code intent is “swap-in alternative to GR00T-WholeBodyControl.”

Both controllers are interchangeable at the config layer: --robot.controller=HolosomaLocomotionController|GrootLocomotionController (docs/source/unitree_g1.mdx:158,161,241). Since the action space is whole-body (15 or 29 DoF lower body), neither is portable to a single 6-DoF arm like SO-101 or UR20.

Show what's NOT in the code — V-JEPA, UMI, world models — with verification

V-JEPA / V-JEPA-2 / DINOv2 / DINOv3

A repo-wide grep for vjepa, v_jepa, jepa, world_model, dinov2, DINOv returns zero hits in src/ for both 0.4.4 and 0.5.0.

The default vision encoder for ACT and Diffusion Policy is ImageNet ResNet-18, and there is a hard ResNet-only validator at:

ACT — vision_backbone: str = "resnet18" with "must be one of the ResNet variants" check at policies/act/configuration_act.py:99-100,137
Diffusion — same default and same validator at policies/diffusion/configuration_diffusion.py:122,171

The only non-ResNet SSL-style encoder anywhere in the repo is CLIP ViT-B/32 inside SARM — a stage-aware reward model, not a policy backbone: CLIPModel.from_pretrained("openai/clip-vit-base-patch32") at policies/sarm/processor_sarm.py:94-95.

To enable a V-JEPA-encoder ACT, you'd:

Fork policies/act/configuration_act.py and remove (or extend) the "must be one of the ResNet variants" validator at policies/act/configuration_act.py:137. Add vjepa / vjepa2 to the vision_backbone enum.
Patch policies/act/modeling_act.py's feature projector to accept V-JEPA's patch-token output (typically larger embed dim than ResNet-18's 512). Re-thread backbone_out_channels.
Add the V-JEPA repo as a vendored dep or pip-install the upstream package. The repo currently has zero V-JEPA imports — you are adding the first one.
Decide on freezing strategy: V-JEPA pre-trained weights are typically frozen for the first N steps, then unfrozen with a small LR. ACT's default training loop has no equivalent of GR00T's tune_visual / tune_llm toggles.
Plan ~1–2 weekends for a clean implementation; another weekend to re-collect the comparison curve against ResNet-18 baseline.

UMI / Stanford handheld GoPro

Searches for umi, universal_manipulation, umi-on-legs, gopro, fisheye, orb_slam return no source-code hits.

The single string "lerobot/umi_cup_in_the_wild" at __init__.py:101 is just a dataset slug listed alongside aloha_*, unitreeh1_*, etc. i.e., the original Stanford UMI cup dataset has been converted to LeRobotDataset format and pushed to the HF hub, but there is no UMI-specific ingestion code — no SLAM trajectory parser, no GoPro fisheye undistortion, no handheld-gripper kinematics shim, and no umi-on-legs reference anywhere.

No world-model subsystem

No diffusion-world-model rollout, no Dreamer-style latent imagination, no learned-dynamics model used for planning. The only model-based-control path is the MPC inside tdmpc (TD-MPC2), which uses an explicit learned latent dynamics + reward model for short-horizon planning — not what the “world model” literature usually means.

External claims — what is and isn't source-anchored

References to upstream papers / repos are tagged [external claim]; the file:line citation is to the comment / docstring / docs page that names them, not to a verified mirror of the external work.

GR00T N1 paper (arXiv 2503.14734, NVIDIA, 2025) — policies/groot/README.md:3-7,12-21,27
GR00T-WholeBodyControl (NVlabs) — docs/source/unitree_g1.mdx:254
Holosoma (Amazon FAR) — docs/source/unitree_g1.mdx:255
Wall-OSS (arXiv 2509.11766, X-Square Robot) — docs/source/walloss.mdx:3

Where to go next →

Should I upgrade from 0.4.4? — the upgrade decision page. ACT and Diffusion are byte-identical between tags; find out what's actually different and whether it's worth a Python rebuild.