Job Description
Senior DevOps Engineer - AI / ML Infrastructure
Position Overview : We are seeking an experienced Senior DevOps Engineer to build and maintain the production infrastructure for our enterprise AI automation platform. This role combines traditional DevOps expertise with specialized knowledge of AI / ML workloads, focusing on reliability, scalability, and cost optimization of agentic AI systems. The successful candidate will work as part of our Agentic AI development team to ensure robust, production-ready deployments of complex AI workflows.
Key Responsibilities :
- Design and implement CI / CD pipelines for AI applications including model deployment and agent workflows
- Build and maintain Kubernetes clusters optimized for AI workloads including GPU resource management
- Implement comprehensive monitoring and observability for AI systems including custom metrics for model performance
- Develop infrastructure-as-code solutions for scalable AI service deployments
- Establish reliability engineering practices including SLA management and incident response for AI systems
- Optimize cloud infrastructure costs with focus on GPU utilization and LLM API usage
- Implement security and compliance frameworks for AI applications and data pipelines
- Collaborate with development teams to ensure production readiness of AI agents and RAG systems
- Manage multi-cloud deployments and vendor integrations for AI services
Required Qualifications :
Bachelor's degree in Computer Science, Engineering, or related technical field7-10 years of DevOps / Infrastructure experience with demonstrated production system ownershipStrong expertise in Kubernetes orchestration and container management (Docker)Proficient in Python scripting and automationExtensive experience with Linux system administration and performance tuningHands-on experience with Jenkins or similar CI / CD platformsProduction experience with cloud platforms (AWS, GCP, or Azure)Experience with Infrastructure-as-Code tools (Terraform, CloudFormation, or similar)AI / ML Infrastructure Requirements :
Experience deploying and managing AI / ML workloads in production environmentsUnderstanding of RAG system infrastructure requirements and vector database operationsKnowledge of LLM API integration patterns and rate limiting strategiesExperience with GPU cluster management and resource optimizationFamiliarity with AI agent workflows and their operational characteristicsSite Reliability Engineering Skills :
Production monitoring and alerting experience with tools like Prometheus, Grafana, or DataDogIncident response and post-mortem experience with complex distributed systemsCapacity planning and performance optimization for high-traffic applicationsExperience with log aggregation and distributed tracing systemsUnderstanding of reliability patterns including circuit breakers and graceful degradationPreferred Qualifications :
Experience with MLOps practices and model deployment pipelinesKnowledge of AI-specific monitoring including model drift detection and performance metricsExperience with cost optimization strategies for AI workloadsBackground in financial services, gaming, or other high-availability environmentsCertification in major cloud platforms (AWS Solutions Architect, GCP Professional, etc.)Experience with service mesh technologies (Istio, Linkerd)Technical Environment :
Multi-cloud infrastructure with primary focus on AWS / GCPKubernetes-based container orchestrationModern observability stack with custom AI metricsGitOps workflows and infrastructure automationIntegration with enterprise security and compliance frameworks