DATA

Where your data lives

6 min read · for SO-101 users

Every lerobot-record session creates a folder with a parquet table of motor states + actions and per-camera mp4s. The format on disk (v3.0) hasn't changed between 0.4.4 and 0.5.0 — your existing datasets work with both.

You don't need to re-record anything for the upgrade. If you push to the Hub, your old datasets remain valid. If you have ancient v2.1 datasets sitting around, there's a one-shot non-destructive converter.

A dataset is a folder of recorded demos with an index card. Motor numbers and actions live in parquet (fast random access for training). Camera frames live in mp4 (compact). A small meta/ folder is the index card — it tells you how many episodes, which task labels, and where each episode starts.

your-task/
├── meta/
│   ├── info.json          # how many episodes, fps, schema
│   ├── tasks.parquet      # task labels ("pick the red cube")
│   └── episodes/
│       └── chunk-000/file-000.parquet   # per-episode stats + lengths
├── data/
│   └── chunk-000/
│       └── file-000.parquet   # motor states, actions
└── videos/
    └── observation.images.front/
        └── chunk-000/file-000.mp4

The basics

What gets recorded?

At every tick (e.g. 30 times per second), lerobot-record writes one row: motor positions, the action that was sent to the robot, every camera frame, a timestamp, an episode index, and the current task string. Your leader arm provides the action; your follower arm provides the observation.

Why parquet + mp4 instead of one big file?

Training reads tiny random slices — "give me batch indices [42, 901, 17, 388]". Parquet is columnar, so pulling four rows is fast. Video frames are huge if stored as raw arrays, so they go into mp4 and get decoded on demand. The trick: at read time, LeRobot looks up which mp4 + which timestamp range covers the frame you asked for.

What's an episode?

One demo from start to end. You press a key to start recording, do the task with your leader arm, press a key to end, and that's one episode. Most SO-101 training runs need 30–100 episodes per task.

Format versions: v2.1 vs v3.0

The format on disk has its own version number, separate from the LeRobot pip version. Both v0.4.4 and v0.5.0 ship v3.0. The previous format (v2.1) stored one parquet and one mp4 per episode; v3.0 packs many episodes into size-bounded files (default 100 MB for parquet, 200 MB for video). A 1000-episode dataset that had 1000 mp4 files in v2.1 has tens in v3.0. If you have v2.1 data, the converter handles it (see workflows below).

Hub upload

dataset.push_to_hub() uploads your dataset as a HuggingFace dataset repo. The temp images/ staging directory (where PNGs land before mp4 encoding) is excluded automatically; videos are included unless you opt out. Re-tags v3.0 on completion.

How you actually use it

Record 50 demos with a leader arm:

# Teleoperate the follower with a leader arm, record everything lerobot-record \ --robot.type=so101_follower \ --teleop.type=so101_leader \ --dataset.repo_id=youruser/pick-cube \ --dataset.num_episodes=50

Convert an old v2.1 dataset to v3.0 (non-destructive — keeps the original):

# Builds <root>_v30, swaps it in, archives the original as <root>_old python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 \ --repo-id=youruser/old-task \ --root=/path/to/old_dataset \ --push-to-hub=false

Inspect or visualize a dataset:

# Browser UI: scrub through episodes, see motor traces + camera frames lerobot-dataset-viz \ --dataset.repo_id=youruser/pick-cube

Things to know

MultiLeRobotDataset is disabled

For mixing real and sim datasets (or any multi-source training), the old MultiLeRobotDataset path raises NotImplementedError. The supported pattern is: call aggregate_datasets() to merge sources into a single dataset first, then if you want non-uniform weighting (e.g. 70/30 real/sim), wrap a custom Sampler over EpisodeAwareSampler.indices.

Per-episode stats moved — episodes_stats.jsonl is gone

In v3.0, per-episode stats live as flatten-dict columns (stats/<feature>/<key>) inside the episodes parquet, not in a separate episodes_stats.jsonl. The conversion script handles this for you. If a downstream tool expects the old jsonl, point it at meta/stats.json for aggregate stats.

The conversion script is non-destructive

It builds <root>_v30, swaps it via shutil.move, and keeps the original as <root>_old. Verify the new dataset loads, then delete <root>_old when you're confident.

Both 0.4.4 and 0.5.0 read the same v3.0 datasets

You can pin either pip version against the same on-disk dataset. There were no info.json schema changes, no API breaks, and no parquet column changes between the two versions — just one cosmetic pd.Index name and some Python 3.12 syntax cleanup.

lerobot-edit-dataset for trim / merge / relabel

You don't have to re-record to fix a dataset. Surgical operations are exposed in dataset_tools.py: delete_episodes, split_dataset, merge_datasets, modify_features, add_features, remove_feature, modify_tasks. Each emits a copy at output_dir — your original is never mutated.

Optional: under the hood
Show the LeRobotDataset class internals (file:line refs)

Two-class split inside src/lerobot/datasets/lerobot_dataset.py:

  • LeRobotDatasetMetadata (lerobot_dataset.py:86-545) holds info, tasks, subtasks, episodes, stats and the parquet writer. .create() classmethod at lerobot_dataset.py:500.
  • LeRobotDataset(torch.utils.data.Dataset) wraps it for training, starts at lerobot_dataset.py:566.

LeRobotDataset.__init__ (lerobot_dataset.py:567) takes:

repo_id, root, episodes, image_transforms, delta_timestamps,
tolerance_s, revision, force_cache_sync, download_videos,
video_backend, batch_encoding_size, vcodec="libsvtav1",
streaming_encoding=False, encoder_queue_maxsize=30, encoder_threads=None

Construction sequence (lerobot_dataset.py:707-781): build metadata → check version compatibility → try local cache, else snapshot_download → build _absolute_to_relative_idx if episodes is a subset → resolve delta_indices from delta_timestamps.

__getitem__ (lerobot_dataset.py:1082-1118): lazy-load hf_dataset → pull row → if delta_indices is set, _get_query_indices (:972-999) returns clipped indices plus a {key}_is_pad mask → _query_hf_dataset batches non-video keys (:1020-1046) → _query_videos (:1048-1067) shifts timestamps by ep["videos/{vid_key}/from_timestamp"] before decoding (this is why concatenated mp4s look like per-episode mp4s to the reader) → image transforms per camera → task string materialized from task_index.

Episode boundaries. dataset_from_index / dataset_to_index are stored in the per-episode metadata parquet (lerobot_dataset.py:352-354, 389-390) and consumed by EpisodeAwareSampler (sampler.py:42-47).

delta_timestamps is your action-chunking knob — a dict[str, list[float]] like {"action": [0, 0.02, 0.04, …]}. The training factory datasets/factory.py:38-68 builds it from the policy's action_delta_indices / observation_delta_indices / reward_delta_indices divided by fps.

Memory layout. hf_dataset is a pyarrow-memmapped datasets.Dataset over all data/chunk-*/file-*.parquet — no RAM cost beyond the mmap. Tasks: pd.DataFrame indexed by task string. Episodes: a datasets.Dataset from meta/episodes/chunk-*/file-*.parquet with stats/* columns dropped on load.

Default features (auto-injected) at utils.py:73-79: timestamp, frame_index, episode_index, index, task_index.

Show the v2.1 → v3.0 schema migration in detail

Canonical schema-change spec lives in the header of datasets/v30/convert_dataset_v21_to_v30.py:92-133.

v2.1 (legacy)

Data parquet

data/chunk-000/episode_000000.parquet
# 1 file per episode

Video mp4

videos/chunk-000/CAMERA/episode_000000.mp4
# 1 mp4 per episode per cam

Episode metadata

episodes.jsonl
# row: {episode_index, tasks, length}

Tasks

tasks.jsonl
# row: {task_index, task}

Per-episode stats

episodes_stats.jsonl
# separate file

Aggregate stats — n/a

info.json

has total_chunks, total_videos
v3.0 (current)

Data parquet

data/chunk-000/file-000.parquet
# multi-episode, size-bounded

Video mp4

videos/CAMERA/chunk-000/file-000.mp4
# multi-episode, concatenated

Episode metadata

meta/episodes/chunk-000/file-000.parquet
# row: episode_index, tasks, length,
#      dataset_from_index, dataset_to_index,
#      data/{chunk_index, file_index},
#      videos/<cam>/{chunk_index, file_index,
#                     from_timestamp, to_timestamp},
#      stats/<feature>/<key> flatten-dict

Tasks

meta/tasks.parquet
# index=task string, col=task_index

Per-episode stats

# inlined into episodes parquet
# under stats/<feature>/<key> columns
# episodes_stats.jsonl is GONE

Aggregate stats

meta/stats.json
# min, max, mean, std, count,
# q01, q10, q50, q90, q99

info.json

both keys deleted; gains:
data_files_size_in_mb, video_files_size_in_mb,
per-feature fps

What conversion doesconvert_dataset(...) (datasets/v30/convert_dataset_v21_to_v30.py:451) runs:

  1. validate_local_dataset_version requires v2.1 (:161-169).
  2. convert_info bumps version, deletes total_chunks/total_videos, propagates fps into non-video features (:432-448).
  3. convert_tasks: jsonl → parquet (:172-178).
  4. convert_data: concats with size cap, tracks dataset_from_index / dataset_to_index (:202-250).
  5. convert_videos per camera via concatenate_video_files, tracks from_timestamp / to_timestamp (:303-371).
  6. convert_episodes_metadata folds legacy + chunk indices + stats into a single episodes parquet (:408-429).
  7. Hub side deletes legacy patterns (data/chunk*/episode_*, meta/*.jsonl, videos/chunk*) and retags v3.0 (:510-525).

Conversion appends episodes to growing parquet/mp4 until size hits data_files_size_in_mb (100 MB) or video_files_size_in_mb (200 MB), then opens a new file via update_chunk_file_indices (utils.py:97-103).

Migration playbook

  1. Pick chunking knobs. Defaults (100 / 200 MB) are fine for sub-minute episodes at 30 fps. For 5-min episodes at 60 fps with 3 RealSense streams, bump video to ~500 MB.
  2. Run the conversion script (non-destructive).
  3. Verify version. Open meta/info.json and confirm codebase_version == "v3.0"; confirm meta/episodes_stats.jsonl is gone.
  4. Smoke-test loading. Instantiate LeRobotDataset in Python, print ds[0].keys().
  5. Once stable, delete <root>_old. Push to the Hub.
Show the recording loop and what gets written to disk

Entry point: record(cfg: RecordConfig) at scripts/lerobot_record.py:434. Flow (scripts/lerobot_record.py:435-602):

  1. make_robot_from_config + optional make_teleoperator_from_config.
  2. Build dataset_features via aggregate_pipeline_dataset_features (:451-464) — camera streams enter the schema here through robot.observation_features keyed observation.images.<cam>.
  3. LeRobotDataset.create(...) with encoder knobs (:490-504), or LeRobotDataset(...) if resuming.
  4. record_loop(...) (:280-432) runs while timestamp < control_time_s — each iteration obs = robot.get_observation(), run policy or teleop, robot.send_action(...), build a frame via build_dataset_frame(dataset.features, …), then dataset.add_frame(frame) (:414).
  5. After the loop, dataset.save_episode() (:581), wrapped in VideoEncodingManager(dataset) (:532) which flushes pending batch encodes on exit.

Camera frames go through _save_image to images/{key}/episode-XXXXXX/frame-XXXXXX.png; the path is stashed in the buffer; PNGs are removed after mp4 encoding (shutil.rmtree(img_dir) in _encode_video_worker at lerobot_dataset.py:562). Task string is per-frame (lerobot_dataset.py:1192) — multi-task within an episode works.

Performance notes

  • Video decode: default torchcodec if installed, else pyav (datasets/video_utils.py:117-124).
  • Video encode: vcodec="libsvtav1" default (lerobot_dataset.py:580, 696-698); streaming_encoding=True skips the PNG round-trip and encodes during add_frame (lerobot_dataset.py:1208-1210) — docstring says "makes save_episode() near-instant".
  • Parallel encode: save_episode runs ProcessPoolExecutor(max_workers=num_cameras) when parallel_encoding=True and num_cameras > 1 (lerobot_dataset.py:1305-1336). Default true on Linux, false on macOS.
  • Worker gotcha: building a DataLoader(num_workers=0) after a DataLoader(num_workers>0) in the same process can segfault from cross-process video-decoder refs (lerobot_dataset.py:1049-1052) — pick one.
Where to go next →

Inference & async control — how a trained policy actually drives the robot in real time, and why you should not roll out through lerobot-record.