Job Title : VLM Research Engineer
Location : Vapi, Gujarat
Employment Type : Full-Time
Overview
We are seeking a highly skilled VLM Research Engineer to build multimodal (vision-language-action) models for instruction following, scene grounding, and tool use across platforms. The role involves developing advanced models that bridge perception and language understanding for autonomous systems.
Key Responsibilities
Pretrain and finetune VLMs, aligning them with robotics data including video, teleoperation, and language.
Build perception-to-language grounding for referring expressions, affordances, and task graphs.
Develop Toolformer / actuator interfaces to convert language intents into actionable skills and motion plans.
Create evaluation pipelines for instruction following, safety filters, and hallucination control.
Collaborate with cross-functional teams for integration of models into robotics platforms.
Must-Haves
Master’s or Ph D in a relevant field.
1–2+ years of experience in Computer Vision / Machine Learning.
Strong proficiency in Py Torch or JAX; experience with LLMs and VLMs.
Familiarity with multimodal datasets, distributed training, and RL / IL.
Nice-to-Haves
Experience with world models, diffusion-policy integration, and speech interfaces.
Familiarity with sim-to-real
Research Engineer • Vapi, Gujarat, India