What plugs in
7 min read · for SO-101 users
What this means for you: if you want to fine-tune SmolVLA or GR00T on your SO-101 demos, that just works — one extra to install, one config flag, go. If you wanted to plug a V-JEPA visual encoder into ACT instead of its default ResNet, you'd be writing a fork.
SHIPPED IN v0.5.0: ┌──────────────┐ ┌─────────────┐ ┌─────────────┐ │ SmolVLA │ │ GR00T VLA │ │ π0 / π0.5 │ │ (450M, HF) │ │ (NVIDIA) │ │ (PaliGemma) │ └──────────────┘ └─────────────┘ └─────────────┘ ┌──────────────┐ ┌─────────────┐ │ GR00T-WBC │ │ Holosoma │ (G1 humanoid only) │ (locomotion) │ │ (locomotion)│ └──────────────┘ └─────────────┘ NOT IN THE CODE — fork-only: ▢ V-JEPA / DINO encoders ▢ UMI ingestion (Stanford handheld) ▢ Diffusion world models for rollout / planning
Top row is what an SO-101 user can actually use. Middle row is humanoid-only. Bottom row is the work you'd own if you wanted it.
The basics
What's a VLA?
A VLA is a Vision-Language-Action model: a big transformer that takes camera frames plus a text instruction (“put the red block in the bowl”) and outputs robot actions. SmolVLA, GR00T, and π0 are all VLAs — same family, different sizes and pre-training recipes.
What ships natively?
“Native” here means the code is in the repo, a
pip install extra brings it in, and a config flag turns it on.
-
SmolVLA (450M parameters, HuggingFace) — the friendliest
VLA for fine-tuning on a single consumer GPU. Closest thing to a
“weekend VLA” on SO-101. Pretrained checkpoint
lerobot/smolvla_base. -
GR00T N1.5 (3B, NVIDIA) — manipulation only, more
compute, stronger zero-shot. Pretrained checkpoint
nvidia/GR00T-N1.5-3B. Wrapper delegates to NVIDIA's Isaac-GR00T runner. -
π0 / π0.5 / π0-FAST — the SOTA tier,
large models, PaliGemma backbone. Pretrained
lerobot/pi05_base. - GR00T-WBC + Holosoma — locomotion ONNX policies for the Unitree G1 humanoid. Not relevant for arm-only setups.
What's missing?
V-JEPA, V-JEPA-2, DINOv2/DINOv3 (research-grade visual
encoders), UMI (Stanford's handheld GoPro data-collection
pipeline), and diffusion world models (learned-dynamics
planners) are all absent from the codebase. A repo-wide grep for any of
these terms returns zero hits in src/.
Sims
Four optional sim extras: gym-aloha (bimanual ALOHA),
gym-pusht (Push-T benchmark), metaworld
(manipulation suite), hf-libero (LIBERO, Linux only).
None simulate SO-101. For SO-101 in sim you pull a
community env from the HuggingFace hub — not an official one.
How you actually use it
Fine-tune SmolVLA on your demos — the most plug-and-play VLA path:
Fine-tune GR00T N1.5 — bigger model, better zero-shot, more compute:
embodiment_tag=new_embodiment tells GR00T “this is a new
arm shape, learn to map onto it.”
Things to know
Both ACT and Diffusion Policy hard-validate the vision backbone as one of the ResNet variants. Swapping in V-JEPA, DINO, or any other encoder is not a config flag — it's a fork patch. Expand the deep-dive below for the exact files to edit.
v0.5.0 bumped huggingface-hub from 0.34 to 1.0 (a major
version jump). The huggingface-cli binary was renamed to
hf. If you have shell scripts that call
huggingface-cli upload …, they break on upgrade.
SmolVLA fine-tune is the most plug-and-play path; GR00T second (more
compute, better baseline). Skip GR00T-WBC and Holosoma — they're
locomotion controllers gated to the Unitree G1. For G1 locomotion users,
both are shipping ONNX policies — no PyTorch fine-tune needed; pick
one via --robot.controller=….
pi_gemma is not a policy you can train
v0.5.0 added a file called pi_gemma.py that looks
like a new policy. It isn't — it's a shared backbone module
(PaliGemma transformer internals) factored out of π0 and π0.5.
--policy.type=pi_gemma is not registered. Use
--policy.type=pi05 as before.
Show the full native vs bridge integration table (with hub IDs and file refs)
| Integration | Scope | Install path | File:line evidence |
|---|---|---|---|
| SmolVLA | VLA policy. Backbone: HF SmolVLM2-500M-Video-Instruct.
Pretrained weights: lerobot/smolvla_base. |
pip install -e ".[smolvla]" |
policies/smolvla/configuration_smolvla.py:87 policies/smolvla/smolvlm_with_expert.py:64 docs/source/smolvla.mdx:49,58 |
| GR00T (manipulation VLA) | Wrapper around Isaac-GR00T. Default checkpoint
nvidia/GR00T-N1.5-3B. Eagle 2.5-VL VLM in-tree.
Already in 0.4.4 — the groot/ policy folders
diff identically. |
pip install -e ".[groot]" |
policies/groot/configuration_groot.py:58,61,64 policies/groot/modeling_groot.py:20-22 pyproject.toml:149-159 |
| π0 / π0.5 / π0-FAST | SOTA-tier VLAs. PaliGemma 3B backbone, 224-px input. Pretrained
lerobot/pi05_base, lerobot/pi05_libero. |
pip install -e ".[pi]" |
policies/pi05/processor_pi05.py:141 policies/pi0/processor_pi0.py:135 docs/source/pi05.mdx:78-80 |
| GR00T-WBC (locomotion) — NEW in 0.5.0 | Dual-ONNX lower-body locomotion stack for Unitree G1.
Pulls nepyope/GR00T-WholeBodyControl_g1 from HF.
Runs at 50 Hz. |
Loaded via --robot.controller=GrootLocomotionController |
robots/unitree_g1/gr00t_locomotion.py:41,48,62-69 docs/source/unitree_g1.mdx:158,254 |
| Holosoma — NEW in 0.5.0 | Single-ONNX whole-body locomotion controller, 200 Hz, 29 DoF.
Origin: Amazon FAR (github.com/amazon-far/holosoma).
FastSAC default; PPO variant available. |
Loaded via --robot.controller=HolosomaLocomotionController |
robots/unitree_g1/holosoma_locomotion.py:45,52,55-58 docs/source/unitree_g1.mdx:161,255 |
pi_gemma shared backbone — NEW in 0.5.0 |
Shared module for π0 / π0.5 / π0-FAST. PaliGemma 3B at 224 hardcoded as the tokenizer for all three. Not a registered policy — refactor target only. | Pulled in by [pi], [wallx], etc. |
policies/pi_gemma.py:24-39,54,69 policies/pi05/processor_pi05.py:141 policies/pi0_fast/configuration_pi0_fast.py:56 |
| Wall-OSS / X-VLA / SARM | Other VLA variants (Wall-OSS = X-Square Robot, X-VLA = HF research, SARM = stage-aware reward model with CLIP). | pip install -e ".[wallx|xvla|sarm]" |
pyproject.toml:140-161 policies/sarm/processor_sarm.py:94-95 |
| Diffusers | Hard-required runtime dep. DDIM/DDPM in Diffusion Policy; attention/embedding primitives in GR00T action head; LR scheduler in optim. | Always installed |
pyproject.toml:64 policies/diffusion/modeling_diffusion.py:32-33 policies/groot/action_head/cross_attention_dit.py:19-22 |
| HF Hub | Hard dep. Every checkpoint, dataset, env, and the new ONNX policy
downloads (Holosoma, GR00T-WBC).
Bumped from >=0.34.2 to >=1.0.0
— major-version jump. |
Always installed | pyproject.toml:65 |
| Sim envs (gym-aloha, gym-pusht, metaworld, hf-libero) | External HF-org packages declared as optional extras. No SO-101 sim shipped — use EnvHub LeIsaac (community-published). | pip install -e ".[aloha|pusht|metaworld|libero]" |
pyproject.toml:175-178 docs/source/envhub_leisaac.mdx |
IsaacLab Arena is an in-tree config option (--env.type=hub with
default nvidia/isaaclab-arena-envs) but it pulls envs from the
HuggingFace hub, not from a vendored sim. Verify any community env you pull.
Show GR00T-WBC and Holosoma details (Unitree G1 only)
v0.5.0 ships two distinct GR00T-branded artifacts that
share only the brand. The manipulation VLA (policy.type=groot)
is unchanged from 0.4.4. The locomotion controllers (--robot.controller=…)
are new in 0.5.0 and apply only to the G1 humanoid.
NVIDIA · dual-ONNX · 50 Hz
- Two ONNX files: Balance + Walk.
CONTROL_DT = 0.02→ 50 Hz. gr00t_locomotion.py:41- 15-DoF action.
- 6-step stacked observation history.
- 3-vector velocity command
[vx, vy, theta_dot]. - Default checkpoint:
nepyope/GR00T-WholeBodyControl_g1. - Origin:
github.com/NVlabs/GR00T-WholeBodyControl.
Amazon FAR · single-ONNX · 200 Hz
- One ONNX file (default
fastsac_g1_29dof.onnx; PPO variantppo_g1_29dof.onnxavailable). CONTROL_DT = 0.005→ 200 Hz.- Drives 29 DoF directly.
- Phase variable
[0, π]withGAIT_PERIOD = 0.5 s. kp/kdjoint gains pulled from ONNX metadata fields — gains travel with the policy.- Default checkpoint:
nepyope/holosoma_locomotion. - Origin:
github.com/amazon-far/holosoma[external claim].
Holosoma drops the upper-body arm joints from the observation vector before feeding the lower-body locomotion ONNX (robots/unitree_g1/holosoma_locomotion.py:161-163). The pattern (FastSAC + PPO with phase-conditioned gait, zero-arm-obs masking) is consistent with sim-to-real RL locomotion stacks; the source-code intent is “swap-in alternative to GR00T-WholeBodyControl.”
Both controllers are interchangeable at the config layer:
--robot.controller=HolosomaLocomotionController|GrootLocomotionController
(docs/source/unitree_g1.mdx:158,161,241). Since
the action space is whole-body (15 or 29 DoF lower body), neither is portable
to a single 6-DoF arm like SO-101 or UR20.
Show what's NOT in the code — V-JEPA, UMI, world models — with verification
V-JEPA / V-JEPA-2 / DINOv2 / DINOv3
A repo-wide grep for vjepa, v_jepa, jepa,
world_model, dinov2, DINOv returns
zero hits in src/ for both 0.4.4 and 0.5.0.
The default vision encoder for ACT and Diffusion Policy is ImageNet ResNet-18, and there is a hard ResNet-only validator at:
- ACT —
vision_backbone: str = "resnet18"with"must be one of the ResNet variants"check at policies/act/configuration_act.py:99-100,137 - Diffusion — same default and same validator at policies/diffusion/configuration_diffusion.py:122,171
The only non-ResNet SSL-style encoder anywhere in the repo is
CLIP ViT-B/32 inside SARM — a stage-aware reward model,
not a policy backbone:
CLIPModel.from_pretrained("openai/clip-vit-base-patch32") at
policies/sarm/processor_sarm.py:94-95.
To enable a V-JEPA-encoder ACT, you'd:
- Fork
policies/act/configuration_act.pyand remove (or extend) the"must be one of the ResNet variants"validator at policies/act/configuration_act.py:137. Addvjepa/vjepa2to thevision_backboneenum. - Patch
policies/act/modeling_act.py's feature projector to accept V-JEPA's patch-token output (typically larger embed dim than ResNet-18's 512). Re-threadbackbone_out_channels. - Add the V-JEPA repo as a vendored dep or pip-install the upstream package. The repo currently has zero V-JEPA imports — you are adding the first one.
- Decide on freezing strategy: V-JEPA pre-trained weights are typically
frozen for the first N steps, then unfrozen with a small LR. ACT's
default training loop has no equivalent of GR00T's
tune_visual/tune_llmtoggles. - Plan ~1–2 weekends for a clean implementation; another weekend to re-collect the comparison curve against ResNet-18 baseline.
UMI / Stanford handheld GoPro
Searches for umi, universal_manipulation,
umi-on-legs, gopro, fisheye,
orb_slam return no source-code hits.
The single string "lerobot/umi_cup_in_the_wild" at
__init__.py:101 is just a dataset slug
listed alongside aloha_*, unitreeh1_*, etc.
i.e., the original Stanford UMI cup dataset has been converted
to LeRobotDataset format and pushed to the HF hub, but there is
no UMI-specific ingestion code — no SLAM trajectory
parser, no GoPro fisheye undistortion, no handheld-gripper kinematics shim,
and no umi-on-legs reference anywhere.
No world-model subsystem
No diffusion-world-model rollout, no Dreamer-style latent imagination,
no learned-dynamics model used for planning. The only model-based-control
path is the MPC inside tdmpc (TD-MPC2), which uses an explicit
learned latent dynamics + reward model for short-horizon planning —
not what the “world model” literature usually means.
References to upstream papers / repos are tagged [external claim]; the file:line citation is to the comment / docstring / docs page that names them, not to a verified mirror of the external work.
- GR00T N1 paper (arXiv 2503.14734, NVIDIA, 2025) — policies/groot/README.md:3-7,12-21,27
- GR00T-WholeBodyControl (NVlabs) — docs/source/unitree_g1.mdx:254
- Holosoma (Amazon FAR) — docs/source/unitree_g1.mdx:255
- Wall-OSS (arXiv 2509.11766, X-Square Robot) — docs/source/walloss.mdx:3