Job Details :
Job Title : Site Reliability Engineer (SRE) With Azure & AI
Duration : Contract Position (On the Payroll of Datum Technology Group)
Location : Chennai || Mumbai || Gurugram
Interview Process : Virtual (2 Rounds) + 1 Technical screening.
Job Description :
- We are seeking a skilled and collaborative Site Reliability Engineer (SRE) with deep expertise in Azure cloud hosting, AI infrastructure, and automation.
- The ideal candidate will have hands-on experience managing cloud environments using GitHub / Azure DevOps lifecycle, and a strong understanding of AI model deployment and scaling.
- You will work closely with a team of engineers to ensure reliable, secure, and scalable infrastructure for AI workloads and enterprise applications.
Key Responsibilities
Design, build, and maintain scalable cloud infrastructure on Microsoft Azure.Automate infrastructure provisioning and deployment using Terraform, Argo, and Helm.Manage and optimize Azure Kubernetes Service (AKS) clusters for AI and microservices workloads.Support hosting of AI models using frameworks like Huggingface Transformers, vLLM, or Llama.Cpp on Azure OpenAI, VMs, or GPUs.Implement CI / CD pipelines using GitHub Actions and integrate with JFrog Artifactory.Monitor system performance and reliability using Grafana and proactively address issues.Collaborate with software engineers to ensure infrastructure supports application needs.Ensure compliance with networking and information security best practices.Manage caching and data layer performance using Redis.Required Skills & Technologies
Core to Role :
Azure Cloud Services (including Azure OpenAI)AI Model Hosting & Infrastructure KnowledgeGitHub (CI / CD, workflows)Azure Kubernetes Service (AKS)Argo, HelmTerraformDockerJFrogGrafanaNetworking & SecurityRedisQualifications
Bachelor's or master's degree in computer science, Engineering, or related field.6+ years of experience in SRE, DevOps, or Cloud Infrastructure roles .Proven experience with AI infrastructure and model deployment.Strong communication and teamwork skills.