We are looking for an experienced Senior AI Platform Engineer to join our team and lead the development of our core AI execution infrastructure. This role is central to our strategy, focusing on designing, building, and scaling a robust, high-performance platform capable of deploying and managing thousands of concurrent AI agents. You will be responsible for ensuring the platform provides reliability, observability, and cost-efficiency at scale. What You Will Do
Platform Architecture : Design and implement the core components of the AI platform, including runtime environments, distributed scheduling, and resource management systems (e.G., CPU / GPU compute clusters, autoscaling).
Scalability & Performance : Develop and optimize distributed systems and microservices to handle massive scale and low-latency requirements for agent execution.
Infrastructure Automation : Work closely with DevOps / SRE teams to automate deployment, scaling, and monitoring using technologies like Kubernetes, Terraform, and CI / CD pipelines.
Agent Lifecycle : Implement APIs and services that manage the full lifecycle of an AI agent, from ingestion and registration to execution, monitoring, and versioning.
Observability : Implement comprehensive logging, tracing, and metrics for all platform components to provide deep insights into agent behavior and system health.
Best Practices : Drive best practices for code quality, security, and operational excellence within the platform engineering team.
Key Requirements
10+ years of professional software engineering experience, with a significant focus on cloud infrastructure and platform development.
Mandatory : Proven expertise in developing, deploying, and scaling distributed systems (e.G., Kafka or HPC distributed compute frameworks) in a production environment.
Expert-level proficiency in at least one modern programming language (e.G., .NET, Python, Rust).
Deep practical experience with container orchestration (e.G., Kubernetes, Knative), cloud providers (e.G., Azure, AWS) and infra as code (Terraform).
Solid understanding of networking, security, and performance optimization for data-intensive applications.
Experience building high-throughput APIs (REST / gRPC) and developing platform service interfaces.
Bonus Points
Prior experience in MLOps, LLMOps, or building systems specifically for running machine learning models or AI agents.
Familiarity with agentic frameworks, large language models (LLMs), agent protocols (MCP, A2A) and their unique deployment challenges.
Experience with high-performance computing (HPC) environments or GPU virtualization.
Comes from an lead AI software company or a software company delivering LLM frameworks.
Ai Platform Lead • Bengaluru, Karnataka, India