Our Mission
To make video as accessible to machines as text and voice are today.
At
lookup , we believe the world's most valuable asset is trapped. Video is everywhere, but it's unsearchable—a black box of insight that no one can open or atleast open affordably. We’re changing that. We're building the search engine for the visual world, so anyone can find or do anything with video just by asking.
Text is queryable. Voice is transcribed. Video, the largest and richest data source of all, is still a black box. A computer can't understand it, and so its value remains trapped.
Our mission at
lookup
is to fix this.
When video is truly accessible to machines, it doesn't just mean a person can finally search their archives. It means an entirely new generation of automated workflows, creative tools, and intelligent applications can be built. We are not just building an app; we are building the
API for software to see the world!
About the Role
We are looking for founding Machine Learning Engineers to help build state-of-the-art video understanding and assistant capabilities. You will work across the stack, including query understanding, video understanding, domain-adapted language models, natural language question answering, evaluation, video agents, and experimentation. You will partner closely with customers, deeply understand their pain points, and use the right tools, simple or complex to solve real problems.
What You’ll Do
Prototype, fine‑tune, and productionize vision-language models for video content.
Design and implement embedding pipelines to represent visual and audio signals for text-based retrieval and multimodal RAG.
Build models that generate concise, domain-focused natural language summaries, captions, and transcripts for videos.
Adapt and fine‑tune state-of-the-art VLMs to answer natural language questions grounded in video frames, temporal context, and audio.
Develop evaluation frameworks and benchmarks, including metric definitions, offline experiments, A / B tests, and regression tracking.
Apply and extend classical CV models and trackers (e.g., RF‑DETR, DeepSORT, YOLO‑pose) to power an object / event database and to enable video agents orchestrated with LLMs.
Who You Are
4+ years of professional experience in ML or a related field.
Strong ML engineering background across classical computer vision and modern vision‑language modeling (VLMs).
Hands‑on experience with training and inference for CV models (CNNs, ViTs) and / or VLMs (e.g., LLaVA, Qwen2.5‑VL, InternVL).
Proven ability to design, build, and ship production‑ready ML systems : data pipelines, training loops, evaluation, deployment, and low‑latency serving for CV / GenAI.
Familiarity with vector databases and retrieval‑augmented generation for efficient embedding storage and retrieval.
Strong proficiency in Python and deep learning frameworks such as PyTorch or TensorFlow.
Familiarity with cloud‑native development on AWS is a plus.
Startup experience in high‑pace environments is preferred but not required.
Nice to have (optional) :
Experience with streaming inference, temporal reasoning, and long‑video context handling.
Knowledge of audio modeling, ASR, diarization, and multimodal alignment.
Experience with experiment platforms, feature stores, or online evaluation.
Location & Culture
Full-time,
in-office role in Bangalore
(we’re building fast and hands-on).
Must be comfortable with
a high-paced environment
and
collaboration across PST time zones
for our US customers and investors.
Expect startup speed — daily founder syncs, rapid design-to-prototype cycles, and a culture of deep ownership.
Why You Will Love This Role
Work on the
frontier of video understanding and real-world AI
— products that can redefine trust and automation.
Build production ML systems end to end : modeling, evaluation, and low‑latency serving.
Work closely with founders and collaborate in person in Bangalore.
Competitive salary with meaningful early equity.
https : / / lookupteam.notion.site /
Engineer Computer Vision • India