Responsibilities :
- Own day-2 production operations of a large-scale, AI-first platform on GCP.
- Run, scale, and harden GKE-based workloads integrated with a broad set of GCP managed services (data, messaging, AI, networking, and security).
- Define, implement, and operate SLIs, SLOs, and error budgets across platform and AI services.
- Build and own New Relic observability end-to-end (APM, infrastructure, logs, alerts, dashboards).
- Improve and maintain CI/CD pipelines and Terraform-driven infrastructure automation.
- Operate and integrate Azure AI Foundry for LLM deployments and model lifecycle management.
- Lead incident response, postmortems, and drive systemic reliability improvements.
- Optimize cost, performance, and autoscaling for AI and data-intensive workloads.
Qualifications :
- 6+ years of hands-on experience in DevOps, SRE, or Platform Engineering roles.
- Strong, production-grade experience with GCP, especially GKE and core managed services.
- Proven expertise running Kubernetes at scale in live environments.
- Deep hands-on experience with New Relic in complex, distributed systems.
- Experience operating AI/ML or LLM-driven platforms in production environments.
- Solid background in Terraform, CI/CD, cloud networking, and security fundamentals.
- Comfortable owning production systems end-to-end with minimal supervision. Requirement 2 months in mumbai, Then Bangalore 2 months accommodation at mumbai will be provided
VAYUZ Technologies - DevOps Engineer - Site Reliability • Bangalore