Experience : Three to five years of shipping production AI or machine-learning systems and scaling data-intensive back ends.
Why this role matters
Terrabase is shaping the next frontier of work AI—an adaptive platform where ambient and specialized agents mesh seamlessly to deliver the one answer that matters, instantly and safely. Think category-defining speed, unwavering accuracy, and enterprise-grade guardrails. Your mission : harden that edge with bulletproof eval loops, unbreakable safety nets, and ruthless performance tuning across our multi-agent engine.
To streamline and fast-track screening, please submit your details here : https : / / docs.google.com / forms / d / e / 1FAIpQLSdXchIjToJznrB9w9XunNc3frpipGRluVU2Aq20WBT-ll4A5Q / viewform?usp=header
We’ll be reviewing your responses as part of the initial screening process. Please make sure to complete and submit all details through the form to be considered for the next stage.
What will you do
- Own the evaluation loop : Design offline and real-time test harnesses, golden-set datasets, and automated regression dashboards that grade each new agent release on precision, recall, latency, and cost.
- Harden safety and guardrails : Implement content filters, prompt firewalls, and fallback chains so answers stay compliant with SOC 2 and HIPAA constraints.
- Optimize prompts and retrieval : Iterate on system, user, and tool prompts for diverse enterprise workflows. Tune ranking models and vector search parameters to lift relevance.
- Benchmark LLM approaches : Compare open-weight models, hosted APIs, and fine-tuned derivatives. Present trade-off reports that balance performance with budget.
- Prototype and demo : Build thin, focused proof-of-concepts that show customers new capabilities before we commit to full sprint cycles.
- Document and share best practices : Write concise run-books, design notes, and post-mortems so the next engineer can reproduce your results without guesswork.
- Stay current : Track the latest research on retrieval-augmented generation, tool-calling agents, and evaluation methodologies; bring the most practical ideas into production.
What we look for
Two to five years of building or operating machine learning or data-intensive back-ends in production.Strong work ethic and bias for ownership. You identify problems, propose fixes, and drive them to closure.Clear, systematic thinker. Your design docs read like thinking in public, and your code structure reflects first principles reasoning.Proficient Python engineer comfortable with type hints, pytest, and modern packaging.Hands-on experience with at least one of : LangChain, LangGraph, or other agent frameworks.Familiarity with vector databases and semantic search fundamentals.Evidence of structured problem solving : could be a design doc, a refactored subsystem, or an open-source pull request.You should be comfortable reading and implementing research ideas from scratch, such as those from “Attention Is All You Need” (Vaswani et al., 2017) or similar foundational papers.Clear communication and bias for action—you unblock yourself and raise flags early.Bonus points
Prior work with evaluation libraries such as Ragas, LM-Eval, or Intercode.Experience integrating compliance guardrails or red-team testing for Gen-AI systems.Contributions to open-source AI projects or published technical blogs.Life at Terrabase
We operate as a sharp, humble, fully remote crew that values deep focus and fast feedback. Your code ships to real customers every week, supported by generous GPU budgets and a culture that prizes clear thinking over long meetings.
Terrabase is an equal opportunity employer. We celebrate diversity and are committed to building an inclusive environment for every team member.