Unified Loco-manipulation Control for the Unitree G1 Humanoid Robot with Vision-Language Model Integration
Research in progress -- paper in preparation. If you use any part of this codebase, architecture, or methodology, citation is required. Contact the author for collaboration inquiries.
A hierarchical reinforcement learning system for whole-body loco-manipulation on the Unitree G1 humanoid robot (29DoF + DEX3 hands, 43 joints total). The robot learns to walk, balance, and reach arbitrary 3D targets through a sequential curriculum trained entirely in simulation using NVIDIA Isaac Lab.
The system uses a Separate Policy, Separate Obs architecture (literature standard: SayCan, Berkeley, SoFTA, Mobile-TeleVision). Each policy has its own observation space, reward function, and PPO update. Previous policies are frozen during subsequent training.
HIERARCHICAL ARCHITECTURE
VLM Layer "Pick up the red cup"
(Florence-2) -> Target: {position, orientation} ~1 Hz
|
Triple AC <- vel_cmd + arm_target + hand_cmd
(Separate PPO) -> Joint actions (43 DoF) 50 Hz
|
Unitree G1 12 leg + 3 waist + 14 arm + 14 finger
| Metric | Value |
|---|---|
| Obs/Act | 66 / 15 (12 leg + 3 waist) |
| Curriculum | 9 levels (standing -> omni walk -> push robustness) |
| Velocity range | vx: -0.3~1.0, vy: +-0.4, vyaw: +-1.0 |
| Metric | Value |
|---|---|
| Obs/Act | 39 / 7 (right arm) |
| Validated reach rate | 86.9% (play, 1 env) |
| Avg reach distance | 3.08 cm (< 4 cm industry target) |
| Avg EE displacement | 21.9 cm |
| Workspace | ~55 cm reach, spherical shell around shoulder |
| Falls | 0 in 3000 steps |
| Metric | Value |
|---|---|
| Orient error | ~2.18 rad (no improvement from Stage 2) |
| Position | Preserved (4.4 cm) but orient not learnable via RL |
| Conclusion | Heuristic wrist control or grasp-policy orientation needed |
Each stage loads the previous checkpoint. Policies are trained sequentially with frozen predecessors.
| Stage | Task | Policy | Obs | Act | Status |
|---|---|---|---|---|---|
| 1 | Omnidirectional locomotion | LocoAC | 66 | 15 | Complete |
| 2 | Arm position reaching | ArmAC (loco frozen) | 39 | 7 | Complete |
| 2L | Perturbation-robust loco | LocoAC (arm frozen perturbation) | 66 | 15 | Complete |
| 3L | Variable height / squat | LocoAC (arm frozen) | 66 | 15 | Parked (diminishing returns) |
| 3G | DEX3 finger grasping | GraspAC (fix_root_link) | 45 | 7 | Active (Phase A V3) |
| 4 | Skill chaining | Full pipeline | -- | -- | Planned |
| VLM | Semantic task execution | Florence-2 + skills | -- | -- | Planned |
Fixed-base robot, 3 object shapes (sphere/cylinder/box, 33% each), finger-only policy.
Key findings:
- Passive grasp exploit: objects fall between fingers, policy learns "do nothing" (V1-V2)
- Fix: proximity-gated finger closure reward (
closure * exp(-5*dist)) - Reward budget analysis via RewardLogger (TensorBoard RR/RW/RB/ prefixes)
- Object spawn must be within 5cm of palm (too far = fingers can't reach)
LocoAC (66 -> 15) [512,256,128] + LayerNorm + ELU -- Legs + waist
ArmAC (39 -> 7) [256,256,128] + ELU -- Right arm (7 joints)
GraspAC (45 -> 7) [256,128,64] + ELU -- DEX3 right hand (7 finger joints)
- Each policy has its own obs space (no shared/unified obs)
- Legs: Direct position control --
leg_targets = default_pose + scale * policy_output - Arms: Residual actions --
arm_targets = default_arm + scale * policy_output - Inference:
full_action = cat(loco_act, arm_act, hand_act)-> env.step()
- Simulation: NVIDIA Isaac Lab 2.3.1, Isaac Sim 5.1.0
- RL: Custom PPO (PyTorch), Dual/Triple Actor-Critic
- VLM (planned): Florence-2 / Molmo2
- Robot: Unitree G1 29DoF + DEX3 (43 joints: 12 legs + 3 waist + 14 arms + 14 fingers)
- Hardware: NVIDIA RTX 5070 Ti (12 GB VRAM), Intel i9-13900HX, 32 GB RAM
- Training: 4096 parallel envs, ~17K steps/sec
- Platform: Windows 11 Pro, Python 3.10 (Anaconda env: env_isaaclab)
All commands run from C:\IsaacLab with conda env env_isaaclab activated.
cd C:\IsaacLab
conda activate env_isaaclabStage 1: Omnidirectional Locomotion
Modes: --mode walk, --mode mixed, --mode push
.\isaaclab.bat -p source/isaaclab_tasks/isaaclab_tasks/direct/isaac_g1_ulc/g1/isaac_g1_ulc/play/play_unified_stage_1.py --checkpoint logs/ulc/g1_unified_stage1_2026-02-27_00-05-20/model_best.pt --num_envs 1 --mode mixedStage 2: Arm Position Reaching (< 4cm, 55cm reach)
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\play\play_unified_stage_2_arm.py --checkpoint logs/ulc/g1_stage2_arm_2026-03-06_18-51-31/model_best.pt --num_envs 1 --mode standing --no_orientStage 3: Orient Fine-Tune (Failed -- decommissioned)
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\play\play_unified_stage_3_orient.py --checkpoint logs/ulc/g1_stage3_orient_2026-03-09_13-20-39/model_best.pt --num_envs 1 --mode standingStage 1: Locomotion (from scratch)
.\isaaclab.bat -p source/isaaclab_tasks/isaaclab_tasks/direct/isaac_g1_ulc/g1/isaac_g1_ulc/train/29dof/train_unified_stage_1.py --num_envs 4096 --max_iterations 50000 --headlessStage 2: Arm Reaching (from Stage 1 checkpoint)
.\isaaclab.bat -p source/isaaclab_tasks/isaaclab_tasks/direct/isaac_g1_ulc/g1/isaac_g1_ulc/train/29dof/train_unified_stage_2_arm.py --stage1_checkpoint logs/ulc/g1_unified_stage1_2026-02-27_00-05-20/model_best.pt --num_envs 4096 --max_iterations 30000 --headlessStage 3: Orient Fine-Tune (from Stage 2 checkpoint -- failed experiment)
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\train\29dof\train_unified_stage_3_orient.py --stage2_checkpoint logs/ulc/g1_stage2_arm_2026-03-06_18-51-31/model_best.pt --orient_weight 2.0 --num_envs 4096 --max_iterations 20000 --headlessGrasp Phase A: Fixed-Base Finger Training (3 shapes: sphere/cylinder/box)
# Training (from scratch)
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\train\train_grasp_phase_a.py --num_envs 2048 --max_iterations 40000 --headless
# Smoke test (visual)
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\train\train_grasp_phase_a.py --num_envs 64 --max_iterations 100Grasp Phase B: Fixed-Base + Frozen Arm Reaching
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\train\train_grasp_phase_b.py --arm_checkpoint logs\ulc\g1_stage2_loco_2026-03-14_21-58-52\model_best.pt --num_envs 2048 --max_iterations 50000 --headlessLoco+Arm Hierarchical Test
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\high_low_hierarchical_g1\scripts\test_hierarchical.py --num_envs 4 --max_steps 3000 --checkpoint C:\IsaacLab\logs\ulc\g1_unified_stage1_2026-02-27_00-05-20\model_best.pt --arm_checkpoint C:\IsaacLab\logs\ulc\ulc_g1_stage7_antigaming_2026-02-06_17-41-47\model_best.ptVLM Planning Demo
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\high_low_hierarchical_g1\scripts\demo_vlm_planning.py --num_envs 4 --checkpoint C:\IsaacLab\logs\ulc\g1_unified_stage1_2026-02-27_00-05-20\model_best.pt --arm_checkpoint C:\IsaacLab\logs\ulc\ulc_g1_stage7_antigaming_2026-02-06_17-41-47\model_best.pt --task "Pick up the red cup and place it on the second table" --planner simpletensorboard --logdir logs/Click to expand legacy commands
Stage 1: Standing
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\play\play_ulc_stage_1.py --checkpoint logs\ulc\ulc_g1_stage1_2026-01-05_17-27-57\model_best.pt --num_envs 4Stage 2: Walking
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\play\play_ulc_stage_2.py --checkpoint logs\ulc\ulc_g1_stage2_v2_2026-01-08_16-42-40\model_best.pt --num_envs 4 --vx 0.5Stage 3: Torso Control
# Forward lean
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\play\play_ulc_stage_3.py --checkpoint logs/ulc/ulc_g1_stage3_2026-01-09_14-28-58/model_best.pt --num_envs 4 --vx 0.0 --pitch -0.35
# Walking + lean
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\play\play_ulc_stage_3.py --checkpoint logs/ulc/ulc_g1_stage3_2026-01-09_14-28-58/model_best.pt --num_envs 4 --vx 0.3 --pitch -0.2
# Side lean
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\play\play_ulc_stage_3.py --checkpoint logs/ulc/ulc_g1_stage3_2026-01-09_14-28-58/model_best.pt --num_envs 4 --vx 0.2 --roll 0.15Stage 4: Dual Policy Arm Control
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\play\play_ulc_stage_4_arm_dual.pyStage 5: Arm Reaching
.\isaaclab.bat -p source/isaaclab_tasks/isaaclab_tasks/direct/isaac_g1_ulc/g1/isaac_g1_ulc/play/play_ulc_stage_5_arm.py --checkpoint logs/ulc/g1_arm_reach_2026-01-22_14-06-41/model_19998.pt --num_envs 1Stage 5.5: Combined Loco+Arm
$env:PROJECT_ROOT = "C:\unitree_sim_isaaclab"
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\play\play_ulc_stage_5.5_both.py --loco_checkpoint logs/ulc/ulc_g1_stage3_2026-01-09_14-28-58/model_best.pt --arm_checkpoint logs/ulc/g1_arm_reach_2026-01-22_14-06-41/model_19998.pt --num_envs 1 --vx 0.0Stage 6: Loco-Manipulation (Gaming detected)
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\play\play_ulc_stage_6_simplified.py --checkpoint logs/ulc/ulc_g1_stage6_simplified_2026-02-04_23-41-18/model_best.pt --num_envs 4 --mode walkingStage 7: Anti-Gaming Arm Reaching
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\play\play_ulc_stage_7.py --checkpoint logs\ulc\ulc_g1_stage7_antigaming_2026-02-06_17-41-47\model_best.pt --num_envs 1 --mode walkingPaper Video Mode (30s+30s)
# Stage 7 paper demo
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\play\play_ulc_stage_7.py --checkpoint logs\ulc\ulc_g1_stage7_antigaming_2026-02-06_17-41-47\model_best.pt --mode paper
# Stage 6 gaming demo
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\play\play_ulc_stage_6_unified.py --checkpoint logs/ulc/ulc_g1_stage6_complete_2026-01-31_20-49-39/model_final.pt --loco_checkpoint logs/ulc/ulc_g1_stage7_antigaming_2026-02-06_17-41-47/model_best.pt --mode paperStage 1-3: Standing -> Walking -> Torso
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\train\train_ulc.py --num_envs 4096 --headless --max_iterations 1500
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\train\train_ulc_stage_2.py --num_envs 4096 --headless --max_iterations 6000 --stage1_checkpoint logs/ulc/ulc_g1_stage1_2026-01-05_17-27-57/model_best.pt
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\train\train_ulc_stage_3.py --stage2_checkpoint logs/ulc/ulc_g1_stage2_v2_2026-01-08_16-42-40/model_best.pt --num_envs 4096 --headless --max_iterations 4000Stage 4-5: Arm Control
.\isaaclab.bat -p source/isaaclab_tasks/isaaclab_tasks/direct/isaac_g1_ulc/g1/isaac_g1_ulc/train/train_ulc_stage_4_arm.py --num_envs 4096 --max_iterations 5000 --headless
.\isaaclab.bat -p source/isaaclab_tasks/isaaclab_tasks/direct/isaac_g1_ulc/g1/isaac_g1_ulc/train/train_ulc_stage_5_arm_full.py --num_envs 2048 --max_iterations 15000 --headlessStage 6-8: Loco-Manipulation
$env:PROJECT_ROOT = "C:\unitree_sim_isaaclab"
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\train\train_ulc_stage_6_simplified.py --stage3_checkpoint logs/ulc/ulc_g1_stage3_2026-01-09_14-28-58/model_best.pt --num_envs 2048 --max_iterations 30000 --headless
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\train\train_ulc_stage_7.py --stage3_checkpoint logs/ulc/ulc_g1_stage3_2026-01-09_14-28-58/model_best.pt --num_envs 2048 --max_iterations 30000 --headless
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\isaac_g1_ulc\g1\isaac_g1_ulc\train\train_ulc_stage_8.py --stage3_checkpoint logs/ulc/ulc_g1_stage3_2026-01-09_14-28-58/model_best.pt --num_envs 2048 --max_iterations 30000 --headless| Task | Difficulty | Isaac Lab Feasible | Priority |
|---|---|---|---|
| Pick-place | Easy | Yes | High |
| Drawer opening | Easy | Yes | High |
| Door opening (lever) | Hard | Yes | High |
| Window cleaning | Medium | Yes | Medium |
| Sock inside-out | Medium | Cloth sim needed | Low |
| Pan washing | Hard | Fluid sim needed | Low |
| Key -> Lock | Very Hard | Difficult | Low |
| Peanut butter sandwich | Very Long | No | Low |
Note: Even pi-0.5 fails when trained "from scratch" (VLM initialization). Fine-tuning is required.
g1/isaac_g1_ulc/
config/ Environment and scene configuration (29DoF joints, actuators)
curriculum/ Sequential curriculum definitions
envs/ RL environments (ULC, arm reach, dual arm)
train/
29dof/ Active 29DoF training scripts (Stage 1-3, loco/arm)
train_grasp_phase_a.py Grasp training: fixed base, finger-only, 3 shapes
train_grasp_phase_b.py Grasp training: fixed base + frozen arm reaching
play/ Evaluation and demo scripts
rewards/ Modular reward functions
utils/
reward_logger.py Per-component reward breakdown (TensorBoard RR/RW/RB/)
com_tracker.py Center-of-mass stability tracking
quintic_interpolator.py Smooth trajectory generation
delay_buffer.py Action delay simulation
test/ Kinematics, workspace, joint tests (20+ files)
demo/ Decoupled walking + reaching demos
data/ Pre-computed workspace maps
vlm_integration/ VLM interface (Florence-2)
agents/ PPO hyperparameter configs
external/ Hardware integration (DDS, action provider, camera)
- Separate policy, separate obs is essential. Unified 188-dim obs (65% zeros) caused LayerNorm pollution and gradient dilution. 1 month of failed V1-V5.2. Each policy must define its own obs space.
- Curriculum gaming is real. Proximity rewards + smoothness penalties incentivize stillness. Movement-centric rewards (velocity toward target, progress) and 3-condition reach validation are essential.
- Multi-task needs multi-critic. A single critic receiving mixed signals produces noisy value estimates. Separate critics with separate GAE and PPO updates work much better.
- Passive grasp exploit is the #1 grasping failure mode. If objects can fall into the hand via gravity, policy learns "do nothing." Fix: proximity-gated rewards (
reward * exp(-k*dist)), spawn objects within finger reach (5cm), and ensure approach reward dominates budget. - RewardLogger is essential for debugging. TensorBoard RB/ (reward budget %) reveals dead rewards (<1%) and dominant rewards (>30%) instantly. Without it, reward design is blind guessing.
- Curriculum gate must check task-specific metrics. Reward threshold alone is insufficient. Height tracking needs
h_err < 0.05m, grasping needsgrasp_success_rate > 0.3. Without task gates, robot advances without learning. - KL penalty blocks radical behavior change. Fine-tuning with KL=0.02 preserves old behavior. For new tasks (squat, grasp), KL must be 0.005 or lower.
- fix_root_link=True for grasp training. Free-standing robot with frozen loco policy requires exact obs/action matching. Any mismatch causes immediate collapse. Use fixed root until loco integration is validated separately.
- Training-play consistency matters. Observation thresholds, action scales, and workspace definitions must match exactly between training and evaluation.
| Stage | Checkpoint | Key Metrics |
|---|---|---|
| Stage 1 (Loco) | g1_unified_stage1_2026-02-27_00-05-20/model_best.pt |
9-level curriculum complete |
| Stage 2 (Arm) | g1_stage2_arm_2026-03-06_18-51-31/model_best.pt |
EE=3.08cm, 86.9% rate |
| Stage 2L (Loco robust) | g1_stage2_loco_2026-03-14_21-58-52/model_best.pt |
R=40.25, L5, 1kg stable |
| Stage 3L (Squat) | g1_stage3_loco_squat_2026-03-30_22-12-42/model_31000.pt |
Parked, H=0.763 (target 0.69) |
| Grasp Phase A | g1_grasp_phase_a_2026-04-03_23-04-01/model_best.pt |
R=19.7, contacts=1.9, 3 shapes |
- Grasp Phase A V3: Proximity-gated closure reward, approach=8.0, solve passive exploit
- Grasp Phase B: Frozen arm policy + finger training (fix_root_link=True)
- Grasp Phase C: Frozen loco + arm + finger (full standing robot)
- Skill Chaining: walk_to -> reach -> grasp -> stand_up -> walk_to -> place
- VLM Planner: SayCan/Berkeley architecture, task decomposition + skill executor
- End-to-end: "Pick up the cup from the table, place it in the box"
- Workshop paper (ICRA/RSS)
- ULC: Unified Fine-Grained Controller for Humanoid Loco-Manipulation -- Sun et al.
- Isaac Lab -- NVIDIA
- Unitree G1 -- Unitree Robotics
This is unpublished research work under MIT license.
@misc{yardimci2026g1ulcvlm,
author = {Yardimci, Mehmet Turan},
title = {Hierarchical VLM-ULC for G1 Humanoid Loco-Manipulation},
year = {2026},
note = {Paper in preparation},
url = {https://github.com/mturan33/isaac-g1-ulc-vlm}
}For collaboration or usage inquiries: mehmetturanyardimci@hotmail.com
Mehmet Turan Yardimci
- Cukurova University, Computer Engineering
- GitHub: @mturan33
- LinkedIn: /in/mehmetturanyardimci