# The Lightwheel Simulation Layer

*A design for a simulation-first data infrastructure that closes the sim-to-real gap for embodied AI.*

---

## 0. The problem, stated precisely

Modern robot policies (manipulation, locomotion, whole-body humanoid control) are
data-hungry in the same way LLMs are token-hungry. But unlike text, **robot data does
not exist on the internet.** It has to be *produced*, and the two ways to produce it
both fail at scale:

| Source | Throughput | Cost / hr | Safety | Diversity | Label quality |
|---|---|---|---|---|---|
| Real-world teleop on hardware | 1 robot-hour per robot-hour | high (hardware + operators + facilities) | low (wear, collisions) | bounded by physical setups | manual, noisy |
| Naive game-engine sim | unbounded | ~free | total | unbounded | **wrong physics → wrong labels** |

The naive-sim row is the trap. A policy trained in a simulator whose friction, mass,
compliance, contact dynamics, and optics are wrong will learn a controller for a world
that does not exist. Deployed on hardware it **fails on contact** — the exact moment that
matters in manipulation. This is the **sim-to-real gap**, and it is fundamentally a
*calibration* problem, not a *graphics* problem.

The Lightwheel thesis: **make the simulation correct enough that simulated data is
worth as much as real data, then generate it at internet scale.** Three pillars:

1. **World Generation** — reconstruct real physical spaces as editable, physically-tagged 3D worlds.
2. **Physically-Grounded Physics (SimReady)** — every asset carries *measured* physical parameters; the solver respects them.
3. **Synthetic Data Generation (SDG)** — one calibrated human demonstration is multiplied into thousands of labeled training episodes by randomizing everything that should not matter and keeping invariant everything that should.

This document specifies the layer that makes all three real. The companion artifact —
`kitchen.html` — is a runnable, in-browser instantiation of the whole pipeline on one
scene (a kitchen), so the abstractions below have something concrete to point at.

---

## 1. System overview

```
                          ┌───────────────────────────────────────────────┐
   REAL WORLD             │              LIGHTWHEEL SIM LAYER               │            POLICY
 ┌───────────┐ capture    │                                                 │   episodes  ┌──────────┐
 │ scans,    │──────────▶ │  ① WORLD GEN ─▶ ② SIMREADY ─▶ ③ SDG ─▶ dataset │ ──────────▶ │ training │
 │ measure-  │ measure    │     (USD stage)   (calibrated)   (randomized)   │             │ (Isaac/  │
 │ ments,    │──────────▶ │        ▲              ▲              │          │             │  GR00T)  │
 │ teleop    │            │        │              │              ▼          │             └────┬─────┘
 └───────────┘            │        └────── sim-to-real metrics ◀──┘          │                  │ deploy
        ▲                 │                  (close the loop)                │                  ▼
        └─────────────────┴──────────────── real-robot evals ───────────────┴──────────────  HARDWARE
```

Everything that flows between stages is **OpenUSD**. USD is not a file format here; it is
the *type system* of the platform. A scene, an asset, a robot, a sensor, a randomization
recipe, and a recorded episode are all USD prims with typed attributes and schemas. This
is what lets World Gen, calibration, randomization, and the training stack compose without
glue code.

The runtime substrate is **NVIDIA Omniverse** (USD composition, Hydra/RTX rendering, Fabric
scene description for fast runtime mutation) with **Newton** as the physics engine.

---

## 2. The substrate: why this exact stack

### 2.1 OpenUSD — the scene & asset interchange
- **Composition arcs (sublayers, references, payloads, variants, inherits) are the feature.** A kitchen `references` a counter asset; the counter `references` a SimReady material; a randomization run is a `sublayer` of *opinions* (override friction, swap a variant set) composited over the base stage **non-destructively**. You never mutate the master; you stack opinions. This is exactly the operation SDG needs millions of times.
- **VariantSets** encode discrete diversity: `geometry={clean, chipped}`, `fill={empty, half, full}`, `layout={A,B,C}`. SDG samples variants; training reads the resolved stage.
- **Schemas** (`UsdPhysics`, `UsdGeom`, `UsdLux`, `UsdShade`, plus custom `Lightwheel*` applied schemas) give every physical parameter a *typed, named home* instead of living in a config file beside the mesh.
- **Payloads** keep heavy geometry/splats unloaded until needed → a warehouse with 10⁴ assets is still openable.

### 2.2 Omniverse — runtime & rendering
- **Hydra + RTX** path-traced and real-time rendering for photoreal RGB, plus ground-truth AOVs for free: depth, segmentation (instance/semantic), normals, motion vectors, 2D/3D bounding boxes, optical flow. Synthetic data ships *with perfect labels* — that is the entire point of synthetic over real.
- **Fabric** (USD-in-memory, SoA) lets us mutate thousands of prim transforms per frame (randomization, physics writeback) without USD authoring overhead.
- **Replicator** is the SDG execution engine: a randomization graph + a writer. We extend it with our calibration constraints (see §4).

### 2.3 Newton — the physics engine
[Newton](https://github.com/newton-physics/newton) (NVIDIA × Google DeepMind × Disney Research, on **NVIDIA Warp**) is chosen because robot learning needs three things at once that older engines give only two of:

1. **Differentiable & GPU-parallel.** Warp compiles Python physics kernels to CUDA. Thousands of environments step in parallel on one GPU — the throughput that makes "thousands of episodes in minutes" literally true, and the gradients that enable sim-tuning and differentiable control.
2. **Multiple solvers under one API** (rigid-body featherstone, MuJoCo-Warp for contact-rich manipulation, FEM/XPBD for deformables and cloth). Manipulation lives or dies on contact and deformables; a wire harness, a cable, a soft gripper pad, a dish towel all need compliant models, not just rigid boxes.
3. **OpenUSD-native** (Newton consumes `UsdPhysics`), so the same stage that renders also simulates — no lossy export step where parameters drift.

> **Browser stand-in.** `kitchen.html` cannot run Warp/CUDA, so it uses **Rapier** (a Rust impulse/Featherstone solver compiled to WASM) as a faithful *miniature* of Newton: real rigid-body dynamics, per-material friction & restitution, articulated joints, deterministic stepping. The data model (mass, μ, restitution, compliance carried per asset) is identical to what a Newton-backed stage carries; only the kernel differs. Where the demo fakes something for the browser, it is labeled `// PROXY:` in the source.

### 2.4 The training consumers (out of scope to build, in scope to design for)
The dataset format targets **NVIDIA Isaac Lab** (RL/IL environments) and **GR00T**-style
imitation learning (vision-language-action policies for humanoids). Our episode schema (§6)
is chosen to drop straight into LeRobot/Isaac dataset loaders.

---

## 3. Pillar ① — World Generation

Goal: turn a real space into an **editable, physically-tagged, simulatable USD stage**, fast.

**Pipeline**
1. **Capture.** Multi-view video / phone scan / pro rig. Optionally a metrology pass (calipers, scale, force gauge, friction sled) on the *objects that will be manipulated* — you do not need to measure the walls.
2. **Reconstruct.** **3D Gaussian Splatting** for appearance (radiance field → photoreal novel views) and **photogrammetry / neural meshing** for geometry. Splats give you the *look* of reality; you cannot simulate contact against a splat, so:
3. **Proxy-fit collision.** Fit collision geometry to the splat/mesh: convex decomposition for rigid props, primitive proxies (box/cylinder/capsule) for structure, signed-distance or SDF colliders for complex statics. Visual ≠ collision is a first-class invariant: high-poly splat for the camera, low-poly convex for the solver.
4. **Semantic segmentation & articulation discovery.** Label prims (counter, sink, cabinet-door, faucet-handle). Detect joints: a cabinet door is a revolute joint, a drawer is prismatic, a faucet is revolute. Author them as `UsdPhysics` joints.
5. **Author USD stage.** Emit a composed stage with geometry, materials, lights (as `UsdLux`, recovered from the capture or relit), cameras, and a clean prim hierarchy.

**In the kitchen demo:** the kitchen is *procedurally generated* rather than scanned (no
scanner in a browser), but it is authored to the same target — a clean scene graph
(rendered live as a faux-USD stage tree in the UI), visual meshes separate from convex
collision proxies, lights and cameras as first-class prims, and the faucet/cabinet authored
as articulations. The "Regenerate world" control re-runs the procedural layout the way a
re-scan or a layout variant would.

---

## 4. Pillar ② — Physically-Grounded Physics (SimReady)

**SimReady = a 3D asset that is not just *seen* correctly but *behaves* correctly.** A
SimReady asset is a USD prim carrying, at minimum:

| Category | Attributes (USD) | Why a policy needs it |
|---|---|---|
| Inertial | `mass`, `centerOfMass`, `diagonalInertia` (or density → derived) | grasp force, tipping, swing dynamics |
| Friction | `dynamicFriction`, `staticFriction` (per material, per surface) | slip vs. hold — the #1 manipulation failure |
| Restitution | `restitution` | bounce on placement, settling |
| Compliance | contact stiffness/damping; deformable params (Young's modulus, Poisson) for soft/flexible bodies | cables, towels, soft pads, produce |
| Articulation | joint type, axis, limits, drive stiffness/damping, friction | doors, drawers, faucets, the robot itself |
| Material (visual) | PBR (`UsdShade` MDL/UsdPreviewSurface): albedo, roughness, metalness, transmission | sim-to-real of *perception*, not just dynamics |
| Provenance | source of each value: `measured` / `spec-sheet` / `estimated` / `default` + uncertainty | trust & targeted re-measurement |

**The calibration loop (what makes it "grounded," not "guessed"):**
1. Seed parameters from spec sheets / material databases.
2. **Measure** the high-impact ones on real objects (mass on a scale; μ on an inclined-plane sled; restitution from drop-height/rebound; compliance from a press-and-measure rig).
3. **System-identify** the rest: run the *same* manipulation in sim and on hardware, and use Newton's **differentiability** to backprop the sim-vs-real trajectory error onto physical parameters (`∂error/∂μ`, `∂error/∂stiffness`) until simulated contact matches measured contact.
4. **Tag provenance & uncertainty.** Every value knows where it came from; uncertainty becomes the *range* that domain randomization samples over (§5). Measured → tight range; estimated → wide range. Randomization is therefore not arbitrary — it is bounded by real measurement confidence.

This is the deep idea: **the calibration uncertainty of an asset *is* its randomization
distribution.** You randomize because you are honest about what you don't know, and you
keep tight what you measured.

**In the kitchen demo:** every prop (mug, plate, bowl, pot, bottle, apple, knife, sponge,
saucepan…) ships with mass / static & dynamic friction / restitution / material and a
`provenance` tag, surfaced in the **SimReady inspector** (click any object). The browser
solver consumes those exact numbers (friction & restitution are set per-collider; mass via
density). Swapping a mug for a "wet"/low-μ mug visibly changes whether the gripper holds it
— the demo's way of showing that calibration is causal, not cosmetic.

---

## 5. Pillar ③ — Synthetic Data Generation (SDG)

The multiplier. Architecture: **one calibrated human demonstration → a randomization graph
→ N labeled episodes.**

**5.1 Demonstrate (teleoperation).** A human performs the task once in the *calibrated*
sim via VR/teleop (Apple Vision Pro / OpenXR hand or 6-DoF controllers, or a SpaceMouse, or
a leader-arm). Because the physics are calibrated, the demonstration's contact events are
*real* contact events. We record the full episode (§6). Optionally augment with a few
real-robot teleop episodes for an anchor.

**5.2 Multiply (domain randomization).** Replay/retarget the demonstrated *intent* across a
randomization graph that perturbs everything that *should not* change the correct action,
while preserving task semantics:

- **Visual / appearance** — light color-temperature, intensity, direction & count; PBR material params; textures; camera pose/intrinsics/exposure; post (noise, motion blur). Trains perceptual invariance. *Cheapest, highest yield.*
- **Physical** — friction, mass, restitution, compliance **sampled within each asset's calibrated uncertainty band** (§4). Trains dynamical robustness; this is what closes the contact gap.
- **Layout / semantic** — object poses, distractor objects, clutter, occlusion, container fill level, articulation start state (door ajar?). Uses USD VariantSets + pose sampling. Trains generalization across scenes.
- **Task-preserving retarget** — re-solve the manipulation (IK + motion plan + closed-loop controller) for each new layout so the *action labels stay correct* after randomization. Randomizing the scene without re-solving the action produces mislabeled data; this step is what separates real SDG from screenshot augmentation.

**5.3 Curate.** Filter episodes by success + physical plausibility (no interpenetration,
no NaN, energy sane), balance the distribution, dedupe. **Quality of the multiplier matters
more than raw count** — 1k diverse *correct* episodes beat 100k near-duplicate ones.

**5.4 Render & label.** Omniverse/Replicator renders RGB for every camera and emits
ground-truth AOVs (depth, instance/semantic seg, 2D/3D boxes, normals, flow) — perfectly
labeled, for free, which is the whole reason synthetic beats real.

**In the kitchen demo:** the **Synthetic Data** panel does all of this in miniature — a live
teleop/autopilot demonstration is recorded; "Generate variant" re-randomizes lights /
materials / layout / physics-within-band and **re-solves the pick-and-place** so the label
stays correct; "Multiply ×N" fans this out and stacks a filmstrip of rendered RGB
thumbnails with a running episode counter; "Export" downloads a dataset (episodes + manifest)
in the §6 schema.

---

## 6. Data contract (the episode schema)

One episode = one task execution, multi-modal, fully labeled. JSON/JSONL for the demo;
USD + Parquet/MCAP for production. Chosen to map onto LeRobot / Isaac Lab loaders.

```jsonc
{
  "episode_id": "kitchen-000123",
  "task": { "id": "put_mug_in_sink", "language": "put the mug in the sink", "success": true },
  "scene": {
    "usd_stage": "kitchen.usda",          // the composed, resolved stage
    "randomization": { "seed": 123, "lighting": {...}, "materials": {...}, "layout": {...}, "physics": {...} }
  },
  "robot": { "name": "arm6", "dof": 6 },
  "fps": 30,
  "frames": [
    {
      "t": 0.0333,
      "joint_pos": [...], "joint_vel": [...],
      "ee_pose": { "pos": [x,y,z], "quat": [x,y,z,w] },
      "gripper": 0.0,                       // 0 open … 1 closed
      "action": { "ee_delta": [...], "gripper_cmd": 0 },   // policy-target labels
      "objects": [ { "id": "mug_01", "pos": [...], "quat": [...] } ],
      "contacts": [ { "a": "gripper", "b": "mug_01", "force": 3.2 } ],
      "cameras": { "main": "frame_main_0001.png", "wrist": "frame_wrist_0001.png" }
      // + depth/seg/boxes AOVs in production
    }
  ],
  "provenance": { "sim": "newton", "calibrated": true, "operator": "teleop|autopilot" }
}
```

**Invariant:** every numeric label (action, contact force, object pose) is produced by the
*calibrated* physics, so the label is trustworthy. That single invariant is the product.

---

## 7. Closing the loop — sim-to-real metrics

A simulation platform without a falsifiable fidelity metric is a graphics demo. We track:

- **Trajectory divergence** — replay a sim action open-loop on hardware; measure EE/object pose drift over time. Drives recalibration (§4.3).
- **Contact-event match** — do contacts begin/end at the same poses & forces in sim vs. real? Manipulation-specific and the most diagnostic.
- **Policy transfer Δ** — success rate of the *same* policy in sim vs. on hardware. The end-to-end number that matters; the gap is the sim-to-real gap, quantified.
- **DR sensitivity** — success vs. randomization breadth → find the band that maximizes real transfer without destroying learnability (too-wide DR is as bad as too-narrow).

These metrics feed back into asset calibration and randomization-band tuning. The platform
is a *control loop on its own fidelity*, not a one-shot exporter.

---

## 8. How the kitchen demo maps to the platform

| Platform concept | Production tech | In `kitchen.html` |
|---|---|---|
| Scene interchange | OpenUSD stage | live faux-USD stage tree (UI) + structured scene graph in `world.js` |
| Runtime / render | Omniverse Hydra + RTX | Three.js (PBR, ACES, IBL, soft shadows, procedural textures) |
| Physics | Newton on Warp (GPU) | Rapier (WASM) — real rigid-body, per-material μ/restitution, articulations |
| SimReady asset | USD prim + physics schema + provenance | `assets.js` catalog: geometry + mass/μ/restitution/material/provenance |
| World Gen | Splatting + photogrammetry → USD | procedural kitchen generator + visual/collision split |
| Teleop | Vision Pro / OpenXR | drag-to-move EE target + grip toggle (`teleop.js`) |
| Autopilot retarget | IK + motion plan + controller | analytic base-yaw + 2-link IK + grasp state machine (`robot.js`, `autopilot.js`) |
| SDG | Replicator randomization graph | `randomize.js` (lights/materials/layout/physics-in-band) + re-solve |
| Dataset | USD + Parquet/MCAP | episode JSON export (`recorder.js`), §6 schema |
| Sensors | RTX AOVs (RGB-D, seg, boxes) | multi-camera RGB (main / wrist / overhead) thumbnails |
| Sim-to-real metrics | hardware eval loop | success-rate + contact readout in HUD (sim-only, labeled PROXY) |

The demo is deliberately *honest about its proxies* (search `PROXY:` in source) so it reads
as a faithful scale model of the real layer, not a misrepresentation of it.

---

## 9. Build order (what ships, in what sequence)

1. **Substrate** — Rapier world + fixed-step loop + Three.js renderer with IBL/shadows. *(physics.js, main.js)*
2. **World** — procedural kitchen, visual/collision split, lights & cameras as prims. *(world.js, textures.js)*
3. **SimReady assets** — calibrated prop catalog + spawner + inspector. *(assets.js, ui.js)*
4. **Robot** — arm rig, base-yaw + 2-link IK, gripper, grasp/release via kinematic attach. *(robot.js)*
5. **Demonstrate** — teleop controls + autopilot pick-and-place state machine. *(teleop.js, autopilot.js)*
6. **SDG** — domain randomization (4 axes) + re-solve + filmstrip + episode counter. *(randomize.js)*
7. **Data** — trajectory recorder + dataset/manifest export in §6 schema. *(recorder.js)*
8. **Shell** — HUD, USD stage tree, panels, controls, the product/landing page. *(ui.js, index.html)*

Each stage is independently demonstrable; together they are the pipeline of §1.

---

*Newton & Warp, OpenUSD, Omniverse, Isaac Lab, and GR00T are the real NVIDIA-ecosystem
infrastructure this layer is designed for; Rapier and Three.js are the browser stand-ins
chosen so the entire pipeline is runnable on one machine with no install. The mapping in §8
is the contract between the two.*
