Job Title : Site Reliability Engineer (SRE) – Azure & AI
Experience : 7+ years
Work Mode : Hybrid
Work Location : Chennai / Mumbai / Gurgaon
Job Summary :
We are looking for an experienced Site Reliability Engineer (SRE) with strong expertise in Microsoft Azure , AI infrastructure , and automation . The ideal candidate will have a solid background in managing cloud environments using GitHub / Azure DevOps , and hands-on experience in AI model deployment and scaling . This role involves working closely with engineering teams to deliver reliable, secure, and scalable cloud infrastructure that supports AI workloads and enterprise applications.
Key Responsibilities :
Design, build, and maintain scalable cloud infrastructure on Microsoft Azure .
Automate infrastructure provisioning and deployment using Terraform , Argo , and Helm .
Manage and optimize Azure Kubernetes Service (AKS) for AI and microservices workloads.
Support AI model hosting using frameworks such as Huggingface Transformers , vLLM , or Llama.cpp on Azure OpenAI , VMs , or GPUs .
Implement CI / CD pipelines using GitHub Actions and integrate with JFrog Artifactory .
Monitor and maintain system performance and reliability using Grafana , ensuring proactive issue resolution.
Collaborate with development teams to align infrastructure with application requirements.
Enforce networking and information security best practices .
Manage and optimize caching and data layer performance using Redis .
Required Skills & Technologies :
Azure Cloud Services (including Azure OpenAI )
AI Model Hosting & Infrastructure
GitHub (CI / CD, workflows)
Azure Kubernetes Service (AKS)
Argo , Helm , Terraform
Docker , JFrog , Grafana
Networking & Security , Redis
Site Reliability Engineer • Pushkar, Rajasthan, India