Members of Technical Staff, Physical AI (Robotics / World Models)
Orbifold
Software Engineering, IT, Data Science
Palo Alto, CA, USA
Member of Technical Staff
Physical AI (Robotics / World Models)
Palo Alto, CA
About Orbifold AI
Orbifold AI advances the frontier of physical AI and world model companies through rigorous evaluation and curated, real-world data. We work directly with leading robotics and world model research teams on the field's hardest problem: systematically surfacing where today's foundation models break, and producing the curated multimodal data that closes the gap.
Our work sits at the intersection of data, evaluation, and model training. We design evaluation harnesses that expose a model's true failure modes; each failure becomes a structured data deficit; and our curation pipeline produces the targeted, high-quality, fully verified data that fills it — collected against a specific failure mode, sampled to balance the long tail, annotated to a co-defined taxonomy, and verified before it reaches a training or reinforcement learning run. Each cycle compounds: sharper evaluations expose finer failures, finer failures drive more precise curation, and more precise curation narrows the distance between demo and deployment.
We collaborate with partners end to end, co-designing datasets, evaluation frameworks, and training and RL pipelines that shape how their models learn from real-world signals. The data standard and curation framework we're building will define the next frontier of robotics and world model training.
About the Role
We are hiring Members of Technical Staff to build the data and evaluation foundations for world models and embodied AI systems. Today's frontier models look impressive on cherry-picked demos but break in production: they fail on long-tail edge cases, hallucinate, lose temporal coherence, mishandle contact and causality, generalize poorly out of distribution, and produce silent failures that automated metrics don't catch. Closing that gap requires evaluation infrastructure that can systematically surface, categorize, and diagnose failures—and feed those signals directly back into data and training.
In this role, you will work closely with internal teams and external research partners to design, build, and iterate on data pipelines, workflows, and evaluation frameworks that drive model quality. You'll define what "good" means for a given partner, build the harnesses that measure it, and translate failures into the next round of data curation and training. This is a highly applied role focused on real-world system performance, not purely theoretical research.
Key Responsibilities
- Co-design data pipelines and model training workflows end to end with robotics and world model teams
- Define how multimodal data (video, image, audio, sensor, text) should be structured and indexed for training and evaluation
- Build scalable systems for data ingestion, cleaning, annotation, and taxonomy-driven curation and balanced sampling across the failure modes that matter for partner deployments
- Work on training or fine-tuning models for perception, policy learning, or multimodal reasoning including automated critics and judges that approximate human evaluation at scale
- Iterate on datasets and training setups based on downstream model performance, closing the loop between evaluation findings and the next round of data curation
- Build comprehensive evaluation frameworks, including fine-grained failure taxonomies, edge case discovery, long-tail probing, distribution shift, robustness, and behavior under adversarial or out-of-spec inputs.
- Bridge gaps between raw data, dataset design, and model behavior, and translate evaluation findings into concrete data and training recommendations for partner teams
Preferred Qualifications
- Self-driven and high agency, with experience working in fast-pace applied research or startup environments
- Experience in robotics, embodied AI, world models, spatial-temporal reasoning, multimodal reasoning and/or reinforcement learning
- Strong understanding of model training (e.g. VLA systems, world models, multimodal reasoning, video generation models, etc.) and RL workflows at scale
- Experience working with real-world data pipelines, including collection, preprocessing, and curation
- Experience designing evaluation frameworks including failure mode analysis, benchmark construction, rubrics and metrics design, automated critics that correlate with downstream model behavior
- Ability to reason about how data quality, mixture, and structure impact model performance
- Hands-on experience building or iterating on applied ML systems at scale
- Comfortable operating across both research and engineering, and building production-grade systems
Nice to Have
- Experience with large-scale video or multimodal datasets
- Experience in simulation to real transfer or real-world robotics data collection
- Familiarity with dataset annotation strategies or evaluation frameworks
- Experience working in fast-moving applied research or startup environments
- Experience training or evaluating Multimodal Large Language Models (MLLMs) as critics, judges, or reward models
Why join Orbifold AI?
- Work directly with leading teams building real-world AI systems; the same partners shipping the next generation of world models and embodied agents
- You love being on the cutting edge of RL, physical AI and world model research
- Tackle one of the most important and underexplored problems in the field: turning evaluation from a vanity metric into a real driver of model improvement
- Operate across both data and model training, not just one side — owning the loop from raw data to evaluation to the next training run
- High ownership, fast iteration, and real impact on deployed systems
Apply Now: Send your resume and a short introduction, and any relevant work (papers, projects, repos) to careers@orbifold.ai. We'd love to hear from you!