Job Title : Site Reliability Engineer (SRE) – Azure & AI
Experience : 7+ years
Work Mode : Hybrid
Work Location : Chennai / Mumbai / Gurgaon
Job Summary :
We are looking for an experienced Site Reliability Engineer (SRE) with strong expertise in Microsoft Azure , AI infrastructure , and automation . The ideal candidate will have a solid background in managing cloud environments using GitHub / Azure DevOps , and hands-on experience in AI model deployment and scaling . This role involves working closely with engineering teams to deliver reliable, secure, and scalable cloud infrastructure that supports AI workloads and enterprise applications.
Key Responsibilities :
- Design, build, and maintain scalable cloud infrastructure on Microsoft Azure .
- Automate infrastructure provisioning and deployment using Terraform , Argo , and Helm .
- Manage and optimize Azure Kubernetes Service (AKS) for AI and microservices workloads.
- Support AI model hosting using frameworks such as Huggingface Transformers , vLLM , or Llama.cpp on Azure OpenAI , VMs , or GPUs .
- Implement CI / CD pipelines using GitHub Actions and integrate with JFrog Artifactory .
- Monitor and maintain system performance and reliability using Grafana , ensuring proactive issue resolution.
- Collaborate with development teams to align infrastructure with application requirements.
- Enforce networking and information security best practices .
- Manage and optimize caching and data layer performance using Redis .
Required Skills & Technologies :
Azure Cloud Services (including Azure OpenAI )AI Model Hosting & InfrastructureGitHub (CI / CD, workflows)Azure Kubernetes Service (AKS)Argo , Helm , TerraformDocker , JFrog , GrafanaNetworking & Security , Redis