Preview release/May 27, 2026

RoboInF: Scaling Robot Manipulation Data in Simulation for Embodied Instruction Following

Diverse instruction-following data for robot manipulation at scale: RoboInF automatically generates realistic scenes, natural task instructions, and verified trajectories across objects, actions, and environments.

Author XLANG LabDate May 27, 2026

1M+

successful trajectories

Verified rollouts across diverse tasks and environments.

5K+

tasks

Diverse ability distribution for tabletop manipulation learning.

300+

scenes

Randomized tabletop worlds with realistic clutter and variation.

50+

reward primitives

Compose for unlimited reward and evaluation logic.

Generated trajectory samples

Fully automatic robot data, zero manual annotation

These clips require no human teleoperation or manual annotation. RoboInF generates the scenes, instructions, rewards, motion programs, and successful rollouts end to end.

The tasks go beyond scaling simple pick-and-place: they span two axes, from short-horizon to long-horizon tasks and from rigid-body manipulation to articulation.

Organized Scene*

“Arrange the pink and blue toy cars on the display stand so they are back to back.”

Random Scene*

“Turn the mouse around so it faces the other way”

Organized Scene*

“Hit the parts stand three times with the hammer.”

Random Scene*

“Turn the shampoo bottle upright.”

Organized Scene*

“Place the 7 Up can into the left bottom drawer of the mini cabinet and close the drawer.”

Random Scene*

“Set the brush parallel to the glue stick.”

Organized Scene*

“Place the bowl into the top drawer of the wooden cabinet and close the drawer.”

Random Scene*

“Swap the positions of the cup and the camera.”

Organized Scene*

“Place the pear on the display stand and the apple on the plate, both standing upright.”

Random Scene*

“Group the drink items together and separate the sponge away.”

Organized Scene*

“Position the blue toy car so it is partly hanging off the front edge of the display stand.”

Random Scene*

“Push the ball over to the left side of the table.”

Organized Scene*

“Put all the tools above the black box into the storage box.”

Random Scene*

“Turn the cup upsidedown on the table.”

Organized Scene*

“Stack the two pieces of bread together and place the jam in the remaining open spot.”

Random Scene*

“Arrange the bolt, brush and clay pestle in a straight line from left to right”

Organized Scene*

“Tilt the olive oil bottle slightly 10 degrees to the right.”

*Organized scenes are agentically reconstructed from real-world reference images into simulation (see Scene Generation). Random scenes sample objects and layouts for broad combinatorial coverage. Videos are shown at 1.5× speed.

Contents

Overview Why RoboInF Pipeline Explorer Early model observations Current Scope Citation

Why RoboInF

If you are training a vision-language-action model today, your data options are limited. Real robot teleoperation produces high-quality trajectories but scales slowly and covers narrow task distributions. Internet videos are abundant but lack ground-truth actions, and bridging the embodiment gap remains an open problem [1][2][3][4]. The result is a practical bottleneck: generalist manipulation policies need data that is simultaneously diverse in scenes, natural in language, spatially precise, and physically varied -- and most existing pipelines deliver only one or two of those properties at a time.

Modern VLA models have shown increasingly impressive long-horizon behavior, from household tasks to cooking-style demonstrations [5][6]. Those demonstrations make the data problem more urgent, not less. Generalist manipulation needs training data that covers richer scenes, natural language variation, fine-grained spatial control, and perturbations that do not appear in narrow benchmark distributions.

Recent systems such as GenSim2, RoboTwin, InternData-A1, and MolmoBot show that scalable robot data is becoming a central path toward general-purpose manipulation [7][8][9][10]. RoboInF addresses this by coupling five generation stages into a single pipeline: scene construction, task proposal, reward synthesis, motion-code generation with simulator feedback, and domain-randomized rollout with automatic success filtering. Every retained trajectory has been verified against a generated reward function before it enters the training set.

Pipeline overview

Five generation stages, one training-data output

Each stage produces an artifact that makes the next stage more reliable. The endpoint is not a finished model in this preview; it is filtered VLA supervision containing instructions, observations, actions, and task-success metadata.

01
Scene
randomized tabletop world
02
Task
scene-conditioned instruction
03
Reward
executable evaluate()
04
Program
motion-planning code
05
Rollout
successful trajectory
Output
Output
VLA training record

Pipeline Explorer

The five-stage generation loop

Diverse worlds

Scene Generation - Building Diverse Robot Manipulation Worlds

What this stage solves. Robust policies need more than clean tabletop scenes. They need clutter, realistic object co-occurrence, spatial variation, camera changes, lighting changes, and physical diversity.

How it works. RoboInF uses two complementary scene-generation modes. Random synthesis provides broad combinatorial coverage by sampling everyday objects, converting them into simulation-ready assets, rescaling them to plausible physical sizes, and placing them in physics-valid tabletop layouts. Image-conditioned agentic generation helps reconstruct more natural arrangements from reference images, including object and spatial manifests for kitchen-style or household-style scenes.

Across both modes, RoboInF randomizes object poses, camera views, robot initial states, textures, backgrounds, lighting, and physics parameters. The intended distribution is not one perfect simulated world, but many plausible worlds that expose policies to natural visual and physical variation.

Generated simulation scenes — Generated tabletop scenes from the scene generation stage.

Early model observations

We have begun training VLA models on a subset of the generated data. We are not reporting quantitative results in this preview because we want the first published numbers to come with a reproducible benchmark and ablation study rather than preliminary snapshots.

Qualitatively, models trained with RoboInF data handle perturbations (distractor objects, changed lighting, shifted camera poses) more reliably than our internal baselines, and they follow compositional instructions more consistently. We have also seen early signs of zero-shot sim-to-real transfer, which we are working to characterize rigorously.

For this preview, the main contribution is the data engine itself: a way to automatically generate diverse, realistic, controllable, and verifiable manipulation experience at scale. RoboInF is our first step toward scalable robot data generation that is both broad and inspectable, and we are continuing to expand the pipeline across embodiments, richer physical settings, and stronger mixtures of synthetic and real-world data.

Full results, training recipes, and benchmark details will accompany the data and code release.

Current Scope and Next Steps

This preview focuses on scalable, controllable data generation today. The same pipeline points to the next physical settings, embodiments, and data mixtures we are expanding toward.

Today

Why it matters

Single-arm manipulation

Dual-arm and broader embodiments

More realistic household and industrial manipulation.

Rigid objects

Soft objects, liquids, and deformables

Clothes, food, packaging, and other everyday materials.

Reward-filtered SFT data

RL fine-tuning and multi-task co-training with the same rewards

Move from imitation to optimization without writing new reward functions.

Coarse-grained manipulation tasks

More intricate, fine-grained, and difficult task programs

Stress-test compositional rewards, contact-rich motion code, and long-horizon verification.

Citation

If you think this blog post and the content involved are helpful to you, please cite:

@article{roboinf,
  title = {RoboInF: Scaling Robot Manipulation Data in Simulation for Embodied Instruction Following},
  author = {XLANG Lab},
  journal = {xlang.ai},
  year = {2026},
  month = {May},
  url = "https://xlang.ai/blog/roboinf"
}

References

[1]

Chi, C., Xu, Z., Pan, C., Cousineau, E., Burchfiel, B., Feng, S., Tedrake, R., & Song, S. Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots. arXiv:2402.10329, 2024.

[2]

Ye, S., Jang, J., Jeon, B., Joo, S., Yang, J., Peng, B., Mandlekar, A., Tan, R., Lin, Y., & others. Latent Action Pretraining from Videos. arXiv:2410.11758, 2024.

[3]

Luo, H., Feng, Y., Zhang, W., Zheng, S., Wang, Y., Yuan, H., Liu, J., Xu, C., Jin, Q., & Lu, Z. Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos. arXiv:2507.15597, 2025.

[4]

Dai, W., Lan, K., Zhou, J., Zhao, B., Su, X., Tong, J., Guan, W., & Yang, S. ConLA: Contrastive Latent Action Learning from Human Videos for Robotic Manipulation. arXiv:2602.00557, 2026.

[5]

Physical Intelligence Team. pi_0.7: A Steerable Generalist Robotic Foundation Model with Emergent Capabilities. arXiv:2604.15483, 2026.

[6]

Generalist AI. GEN-1: Scaling Embodied Foundation Models to Mastery. Generalist AI Blog, 2026. https://generalistai.com/blog/apr-02-2026-GEN-1

[7]

Hua, P., Liu, M., Macaluso, A., Lin, Y., Zhang, W., Xu, H., & Wang, L. GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs. arXiv:2410.03645, 2024.

[8]

Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y., Li, Z., Liang, Q., Lin, X., Ge, Y., Gu, Z., & others. RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation. arXiv:2506.18088, 2025.

[9]

Tian, Y., Yang, Y., Xie, Y., Cai, Z., Shi, X., Gao, N., Liu, H., Jiang, X., Qiu, Z., Yuan, F., & others. InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy. arXiv:2511.16651, 2025.

[10]

Deshpande, A., et al. MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation. arXiv:2603.16861, 2026.

[11]

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., & others. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv:2307.15818, 2023.

[12]

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., & others. pi_0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164, 2024.

[13]

Fang, Y., Feng, Y., Jing, D., Liu, J., Yang, Y., Wei, Z., Szafir, D., & Ding, M. When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs. arXiv:2602.17659, 2026.

[14]

Katara, P., Xian, Z., & Fragkiadaki, K. Gen2Sim: Scaling up Robot Learning in Simulation with Generative Models. arXiv:2310.18308, 2023.

[15]

Generalist AI. GEN-0 / Embodied Foundation Models That Scale with Physical Interaction. Generalist AI Blog, 2025. https://generalistai.com/blog/nov-04-2025-GEN-0