WaveWave 2
Preview release/May 27, 2026

RoboInF: Scaling Robot Manipulation Data in Simulation for General Instruction Following

Diverse instruction-following data for robot manipulation at scale: RoboInF automatically generates realistic scenes, natural task instructions, and verified trajectories across objects, actions, and environments.

Author XLANG LabDate May 27, 2026
Share
  • Xlang
  • Xlang
  • Xlang
1M+
successful trajectories

Verified rollouts across diverse tasks and environments.

5K+
tasks

Diverse ability distribution for tabletop manipulation learning.

300+
scenes

Randomized tabletop worlds with realistic clutter and variation.

50+
reward primitives

Compose for unlimited reward and evaluation logic.

Generated trajectory samples

Fully automatic robot data, zero manual annotation

These clips require no human teleoperation or manual annotation. RoboInF generates the scenes, instructions, rewards, motion programs, and successful rollouts end to end.

The tasks go beyond scaling simple pick-and-place: they span two axes, from short-horizon to long-horizon tasks and from rigid-body manipulation to articulation.

*Organized scenes are agentically reconstructed from real-world reference images into simulation (see Scene Generation). Random scenes sample objects and layouts for broad combinatorial coverage. Videos are shown at 1.5× speed.

Why RoboInF

If you are training a vision-language-action model today, your data options are limited. Real robot teleoperation produces high-quality trajectories but scales slowly and covers narrow task distributions. Internet videos are abundant but lack ground-truth actions, and bridging the embodiment gap remains an open problem [1][2][3][4]. The result is a practical bottleneck: generalist manipulation policies need data that is simultaneously diverse in scenes, natural in language, spatially precise, and physically varied -- and most existing pipelines deliver only one or two of those properties at a time.

Modern VLA models have shown increasingly impressive long-horizon behavior, from household tasks to cooking-style demonstrations [5][6]. Those demonstrations make the data problem more urgent, not less. Generalist manipulation needs training data that covers richer scenes, natural language variation, fine-grained spatial control, and perturbations that do not appear in narrow benchmark distributions.

Recent systems such as GenSim2, RoboTwin, InternData-A1, and MolmoBot show that scalable robot data is becoming a central path toward general-purpose manipulation [7][8][9][10]. RoboInF addresses this by coupling five generation stages into a single pipeline: scene construction, task proposal, reward synthesis, motion-code generation with simulator feedback, and domain-randomized rollout with automatic success filtering. Every retained trajectory has been verified against a generated reward function before it enters the training set.

Pipeline overview

Five generation stages, one training-data output

Each stage produces an artifact that makes the next stage more reliable. The endpoint is not a finished model in this preview; it is filtered VLA supervision containing instructions, observations, actions, and task-success metadata.

  1. 01
    Scene
    randomized tabletop world
  2. 02
    Task
    scene-conditioned instruction
  3. 03
    Reward
    executable evaluate()
  4. 04
    Program
    motion-planning code
  5. 05
    Rollout
    successful trajectory
  6. Output
    Output
    VLA training record
Pipeline overview
Pipeline Explorer

The five-stage generation loop

Diverse worlds

Scene Generation - Building Diverse Robot Manipulation Worlds

What this stage solves. Robust policies need more than clean tabletop scenes. They need clutter, realistic object co-occurrence, spatial variation, camera changes, lighting changes, and physical diversity.

How it works. RoboInF uses two complementary scene-generation modes. Random synthesis provides broad combinatorial coverage by sampling everyday objects, converting them into simulation-ready assets, rescaling them to plausible physical sizes, and placing them in physics-valid tabletop layouts. Image-conditioned agentic generation helps reconstruct more natural arrangements from reference images, including object and spatial manifests for kitchen-style or household-style scenes.

Across both modes, RoboInF randomizes object poses, camera views, robot initial states, textures, backgrounds, lighting, and physics parameters. The intended distribution is not one perfect simulated world, but many plausible worlds that expose policies to natural visual and physical variation.

Generated simulation scenes
Generated tabletop scenes from the scene generation stage.

Early model observations

We have begun training VLA models on a subset of the generated data. We are not reporting quantitative results in this preview because we want the first published numbers to come with a reproducible benchmark and ablation study rather than preliminary snapshots.

Qualitatively, models trained with RoboInF data handle perturbations (distractor objects, changed lighting, shifted camera poses) more reliably than our internal baselines, and they follow compositional instructions more consistently. We have also seen early signs of zero-shot sim-to-real transfer, which we are working to characterize rigorously.

For this preview, the main contribution is the data engine itself: a way to automatically generate diverse, realistic, controllable, and verifiable manipulation experience at scale. RoboInF is our first step toward scalable robot data generation that is both broad and inspectable, and we are continuing to expand the pipeline across embodiments, richer physical settings, and stronger mixtures of synthetic and real-world data.

Full results, training recipes, and benchmark details will accompany the data and code release.

Current Scope and Next Steps

This preview focuses on scalable, controllable data generation today. The same pipeline points to the next physical settings, embodiments, and data mixtures we are expanding toward.

Today
Next
Why it matters
Single-arm manipulation
Dual-arm and broader embodiments
More realistic household and industrial manipulation.
Rigid objects
Soft objects, liquids, and deformables
Clothes, food, packaging, and other everyday materials.
Reward-filtered SFT data
RL fine-tuning and multi-task co-training with the same rewards
Move from imitation to optimization without writing new reward functions.
Coarse-grained manipulation tasks
More intricate, fine-grained, and difficult task programs
Stress-test compositional rewards, contact-rich motion code, and long-horizon verification.

Citation

If you think this blog post and the content involved are helpful to you, please cite:

@article{roboinf,
  title = {RoboInF: Scaling Robot Manipulation Data in Simulation for General Instruction Following},
  author = {XLANG Lab},
  journal = {xlang.ai},
  year = {2026},
  month = {May},
  url = "https://xlang.ai/blog/roboinf"
}

References

[1]
Chi, C., Xu, Z., Pan, C., Cousineau, E., Burchfiel, B., Feng, S., Tedrake, R., & Song, S. Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots. arXiv:2402.10329, 2024.
[2]
Ye, S., Jang, J., Jeon, B., Joo, S., Yang, J., Peng, B., Mandlekar, A., Tan, R., Lin, Y., & others. Latent Action Pretraining from Videos. arXiv:2410.11758, 2024.
[3]
Luo, H., Feng, Y., Zhang, W., Zheng, S., Wang, Y., Yuan, H., Liu, J., Xu, C., Jin, Q., & Lu, Z. Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos. arXiv:2507.15597, 2025.
[4]
Dai, W., Lan, K., Zhou, J., Zhao, B., Su, X., Tong, J., Guan, W., & Yang, S. ConLA: Contrastive Latent Action Learning from Human Videos for Robotic Manipulation. arXiv:2602.00557, 2026.
[5]
Physical Intelligence Team. pi_0.7: A Steerable Generalist Robotic Foundation Model with Emergent Capabilities. arXiv:2604.15483, 2026.
[6]
Generalist AI. GEN-1: Scaling Embodied Foundation Models to Mastery. Generalist AI Blog, 2026. https://generalistai.com/blog/apr-02-2026-GEN-1
[7]
Hua, P., Liu, M., Macaluso, A., Lin, Y., Zhang, W., Xu, H., & Wang, L. GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs. arXiv:2410.03645, 2024.
[8]
Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y., Li, Z., Liang, Q., Lin, X., Ge, Y., Gu, Z., & others. RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation. arXiv:2506.18088, 2025.
[9]
Tian, Y., Yang, Y., Xie, Y., Cai, Z., Shi, X., Gao, N., Liu, H., Jiang, X., Qiu, Z., Yuan, F., & others. InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy. arXiv:2511.16651, 2025.
[10]
Deshpande, A., et al. MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation. arXiv:2603.16861, 2026.
[11]
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., & others. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv:2307.15818, 2023.
[12]
Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., & others. pi_0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164, 2024.
[13]
Fang, Y., Feng, Y., Jing, D., Liu, J., Yang, Y., Wei, Z., Szafir, D., & Ding, M. When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs. arXiv:2602.17659, 2026.
[14]
Katara, P., Xian, Z., & Fragkiadaki, K. Gen2Sim: Scaling up Robot Learning in Simulation with Generative Models. arXiv:2310.18308, 2023.
[15]
Generalist AI. GEN-0 / Embodied Foundation Models That Scale with Physical Interaction. Generalist AI Blog, 2025. https://generalistai.com/blog/nov-04-2025-GEN-0
Xlang
© Copyright 2023 XLANG Lab. All right reserved.