Where your data lives
6 min read · for SO-101 users
Every lerobot-record session creates a folder with a parquet
table of motor states + actions and per-camera mp4s. The format on disk
(v3.0) hasn't changed between 0.4.4 and 0.5.0 — your existing datasets
work with both.
You don't need to re-record anything for the upgrade. If you push to the Hub, your old datasets remain valid. If you have ancient v2.1 datasets sitting around, there's a one-shot non-destructive converter.
A dataset is a folder of recorded demos with an index card.
Motor numbers and actions live in parquet (fast random access for training).
Camera frames live in mp4 (compact). A small meta/ folder is
the index card — it tells you how many episodes, which task labels,
and where each episode starts.
your-task/ ├── meta/ │ ├── info.json # how many episodes, fps, schema │ ├── tasks.parquet # task labels ("pick the red cube") │ └── episodes/ │ └── chunk-000/file-000.parquet # per-episode stats + lengths ├── data/ │ └── chunk-000/ │ └── file-000.parquet # motor states, actions └── videos/ └── observation.images.front/ └── chunk-000/file-000.mp4
The basics
What gets recorded?
At every tick (e.g. 30 times per second), lerobot-record writes
one row: motor positions, the action that was sent to the robot, every camera
frame, a timestamp, an episode index, and the current task string. Your leader
arm provides the action; your follower arm provides the observation.
Why parquet + mp4 instead of one big file?
Training reads tiny random slices — "give me batch indices [42, 901, 17, 388]". Parquet is columnar, so pulling four rows is fast. Video frames are huge if stored as raw arrays, so they go into mp4 and get decoded on demand. The trick: at read time, LeRobot looks up which mp4 + which timestamp range covers the frame you asked for.
What's an episode?
One demo from start to end. You press a key to start recording, do the task with your leader arm, press a key to end, and that's one episode. Most SO-101 training runs need 30–100 episodes per task.
Format versions: v2.1 vs v3.0
The format on disk has its own version number, separate from the LeRobot pip
version. Both v0.4.4 and v0.5.0 ship v3.0. The previous format
(v2.1) stored one parquet and one mp4 per episode; v3.0 packs many
episodes into size-bounded files (default 100 MB for parquet, 200 MB for
video). A 1000-episode dataset that had 1000 mp4 files in v2.1 has tens in
v3.0. If you have v2.1 data, the converter handles it (see workflows below).
Hub upload
dataset.push_to_hub() uploads your dataset as a HuggingFace
dataset repo. The temp images/ staging directory (where PNGs
land before mp4 encoding) is excluded automatically; videos are included
unless you opt out. Re-tags v3.0 on completion.
How you actually use it
Record 50 demos with a leader arm:
Convert an old v2.1 dataset to v3.0 (non-destructive — keeps the original):
Inspect or visualize a dataset:
Things to know
MultiLeRobotDataset is disabled
For mixing real and sim datasets (or any multi-source training), the old
MultiLeRobotDataset path raises
NotImplementedError. The supported pattern is: call
aggregate_datasets() to merge sources into a single dataset
first, then if you want non-uniform weighting (e.g. 70/30 real/sim), wrap
a custom Sampler over EpisodeAwareSampler.indices.
episodes_stats.jsonl is gone
In v3.0, per-episode stats live as flatten-dict columns
(stats/<feature>/<key>) inside the episodes parquet,
not in a separate episodes_stats.jsonl. The conversion script
handles this for you. If a downstream tool expects the old jsonl, point it
at meta/stats.json for aggregate stats.
It builds <root>_v30, swaps it via
shutil.move, and keeps the original as
<root>_old. Verify the new dataset loads, then delete
<root>_old when you're confident.
You can pin either pip version against the same on-disk dataset. There were
no info.json schema changes, no API breaks, and no parquet
column changes between the two versions — just one cosmetic
pd.Index name and some Python 3.12 syntax cleanup.
lerobot-edit-dataset for trim / merge / relabel
You don't have to re-record to fix a dataset. Surgical operations are
exposed in dataset_tools.py: delete_episodes,
split_dataset, merge_datasets,
modify_features, add_features,
remove_feature, modify_tasks. Each emits a copy at
output_dir — your original is never mutated.
Show the LeRobotDataset class internals (file:line refs)
Two-class split inside src/lerobot/datasets/lerobot_dataset.py:
-
LeRobotDatasetMetadata(lerobot_dataset.py:86-545) holdsinfo,tasks,subtasks,episodes,statsand the parquet writer..create()classmethod at lerobot_dataset.py:500. -
LeRobotDataset(torch.utils.data.Dataset)wraps it for training, starts at lerobot_dataset.py:566.
LeRobotDataset.__init__
(lerobot_dataset.py:567) takes:
repo_id, root, episodes, image_transforms, delta_timestamps,
tolerance_s, revision, force_cache_sync, download_videos,
video_backend, batch_encoding_size, vcodec="libsvtav1",
streaming_encoding=False, encoder_queue_maxsize=30, encoder_threads=None
Construction sequence
(lerobot_dataset.py:707-781): build
metadata → check version compatibility → try local cache, else
snapshot_download → build
_absolute_to_relative_idx if episodes is a subset
→ resolve delta_indices from delta_timestamps.
__getitem__
(lerobot_dataset.py:1082-1118): lazy-load
hf_dataset → pull row → if
delta_indices is set, _get_query_indices
(:972-999) returns clipped indices plus a
{key}_is_pad mask → _query_hf_dataset batches
non-video keys (:1020-1046) →
_query_videos (:1048-1067) shifts
timestamps by
ep["videos/{vid_key}/from_timestamp"] before decoding (this is
why concatenated mp4s look like per-episode mp4s to the reader) →
image transforms per camera → task string materialized from
task_index.
Episode boundaries.
dataset_from_index / dataset_to_index are stored in
the per-episode metadata parquet
(lerobot_dataset.py:352-354, 389-390) and
consumed by EpisodeAwareSampler
(sampler.py:42-47).
delta_timestamps is your action-chunking knob — a
dict[str, list[float]] like
{"action": [0, 0.02, 0.04, …]}. The training factory
datasets/factory.py:38-68 builds it from the
policy's action_delta_indices /
observation_delta_indices / reward_delta_indices
divided by fps.
Memory layout. hf_dataset is a
pyarrow-memmapped datasets.Dataset over all
data/chunk-*/file-*.parquet — no RAM cost beyond the mmap.
Tasks: pd.DataFrame indexed by task string. Episodes: a
datasets.Dataset from
meta/episodes/chunk-*/file-*.parquet with stats/*
columns dropped on load.
Default features (auto-injected) at
utils.py:73-79: timestamp,
frame_index, episode_index, index,
task_index.
Show the v2.1 → v3.0 schema migration in detail
Canonical schema-change spec lives in the header of datasets/v30/convert_dataset_v21_to_v30.py:92-133.
Data parquet
data/chunk-000/episode_000000.parquet
# 1 file per episode
Video mp4
videos/chunk-000/CAMERA/episode_000000.mp4
# 1 mp4 per episode per cam
Episode metadata
episodes.jsonl
# row: {episode_index, tasks, length}
Tasks
tasks.jsonl
# row: {task_index, task}
Per-episode stats
episodes_stats.jsonl
# separate file
Aggregate stats — n/a
info.json
has total_chunks, total_videos
Data parquet
data/chunk-000/file-000.parquet
# multi-episode, size-bounded
Video mp4
videos/CAMERA/chunk-000/file-000.mp4
# multi-episode, concatenated
Episode metadata
meta/episodes/chunk-000/file-000.parquet
# row: episode_index, tasks, length,
# dataset_from_index, dataset_to_index,
# data/{chunk_index, file_index},
# videos/<cam>/{chunk_index, file_index,
# from_timestamp, to_timestamp},
# stats/<feature>/<key> flatten-dict
Tasks
meta/tasks.parquet
# index=task string, col=task_index
Per-episode stats
# inlined into episodes parquet
# under stats/<feature>/<key> columns
# episodes_stats.jsonl is GONE
Aggregate stats
meta/stats.json
# min, max, mean, std, count,
# q01, q10, q50, q90, q99
info.json
both keys deleted; gains:
data_files_size_in_mb, video_files_size_in_mb,
per-feature fps
What conversion does —
convert_dataset(...)
(datasets/v30/convert_dataset_v21_to_v30.py:451)
runs:
validate_local_dataset_versionrequires v2.1 (:161-169).convert_infobumps version, deletestotal_chunks/total_videos, propagatesfpsinto non-video features (:432-448).convert_tasks: jsonl → parquet (:172-178).convert_data: concats with size cap, tracksdataset_from_index/dataset_to_index(:202-250).convert_videosper camera viaconcatenate_video_files, tracksfrom_timestamp/to_timestamp(:303-371).convert_episodes_metadatafolds legacy + chunk indices + stats into a single episodes parquet (:408-429).- Hub side deletes legacy patterns
(
data/chunk*/episode_*,meta/*.jsonl,videos/chunk*) and retagsv3.0(:510-525).
Conversion appends episodes to growing parquet/mp4 until size hits
data_files_size_in_mb (100 MB) or
video_files_size_in_mb (200 MB), then opens a new file via
update_chunk_file_indices
(utils.py:97-103).
Migration playbook
- Pick chunking knobs. Defaults (100 / 200 MB) are fine for sub-minute episodes at 30 fps. For 5-min episodes at 60 fps with 3 RealSense streams, bump video to ~500 MB.
- Run the conversion script (non-destructive).
-
Verify version. Open
meta/info.jsonand confirmcodebase_version == "v3.0"; confirmmeta/episodes_stats.jsonlis gone. -
Smoke-test loading. Instantiate
LeRobotDatasetin Python, printds[0].keys(). -
Once stable, delete
<root>_old. Push to the Hub.
Show the recording loop and what gets written to disk
Entry point: record(cfg: RecordConfig) at
scripts/lerobot_record.py:434. Flow
(scripts/lerobot_record.py:435-602):
-
make_robot_from_config+ optionalmake_teleoperator_from_config. -
Build
dataset_featuresviaaggregate_pipeline_dataset_features(:451-464) — camera streams enter the schema here throughrobot.observation_featureskeyedobservation.images.<cam>. -
LeRobotDataset.create(...)with encoder knobs (:490-504), orLeRobotDataset(...)if resuming. -
record_loop(...)(:280-432) runswhile timestamp < control_time_s— each iterationobs = robot.get_observation(), run policy or teleop,robot.send_action(...), build aframeviabuild_dataset_frame(dataset.features, …), thendataset.add_frame(frame)(:414). -
After the loop,
dataset.save_episode()(:581), wrapped inVideoEncodingManager(dataset)(:532) which flushes pending batch encodes on exit.
Camera frames go through _save_image to
images/{key}/episode-XXXXXX/frame-XXXXXX.png; the path is
stashed in the buffer; PNGs are removed after mp4 encoding
(shutil.rmtree(img_dir) in _encode_video_worker
at lerobot_dataset.py:562). Task string is
per-frame (lerobot_dataset.py:1192) —
multi-task within an episode works.
Performance notes
-
Video decode: default
torchcodecif installed, elsepyav(datasets/video_utils.py:117-124). -
Video encode:
vcodec="libsvtav1"default (lerobot_dataset.py:580, 696-698);streaming_encoding=Trueskips the PNG round-trip and encodes duringadd_frame(lerobot_dataset.py:1208-1210) — docstring says "makessave_episode()near-instant". -
Parallel encode:
save_episoderunsProcessPoolExecutor(max_workers=num_cameras)whenparallel_encoding=True and num_cameras > 1(lerobot_dataset.py:1305-1336). Default true on Linux, false on macOS. -
Worker gotcha: building a
DataLoader(num_workers=0)after aDataLoader(num_workers>0)in the same process can segfault from cross-process video-decoder refs (lerobot_dataset.py:1049-1052) — pick one.