Description :
Site Reliability Engineer (SRE) - Azure / AKS Lead
Role Overview :
This is a senior technical leadership role for a Site Reliability Engineer (SRE) requiring 10+ years of experience, focused on owning and driving reliability for mission-critical, high-scale services deployed on Microsoft Azure.
The role demands prior experience as a DevOps Engineer transitioning into a dedicated SRE function. The incumbent must possess expert knowledge in Azure, AKS (Azure Kubernetes Service), and modern reliability practices including defining and enforcing SLIs / SLOs.
Based in Trivandrum, this SRE will shape technical standards, lead major incident response, and champion engineering excellence across multiple development teams.
Job Summary :
We are seeking an experienced SRE Lead (10+ years) with strong background in Azure and AKS to ensure the highest levels of availability, performance, and scalability for our Tier-0 / Tier-1 services.
This role is responsible for establishing and maintaining core SRE practices, including defining error budgets, implementing multi-burn-rate alerting, driving continuous automation (Terraform / GitOps), and leading critical incident response with calm clarity. Expertise in observability, disaster recovery design (RTO / RPO), and cluster hardening is mandatory.
Key Responsibilities and Reliability Engineering Deliverables :
- Service Level Management : Define SLIs / SLOs for Tier-0 / Tier-1 services and conduct quarterly reviews. Implement multi-window, multi-burn-rate alerts to precisely detect evolving service degradation.
- Error Budgeting and Change Gating : Enforce reliability constraints by implementing Change gating via CI / CD based on error budgets (using tools like Azure DevOps / GitHub Actions). Conduct weekly SLO reviews & drive the reliability roadmap.
- Incident Management Command : Lead SEV1 / SEV2 incidents as the Incident Commander, taking ownership of rapid resolution, clear communication & postmortems. Ensure all corrective actions are implemented effectively.
- Reliability Architecture & Kubernetes : Design and implement robust reliability patterns including DR (Disaster Recovery), multi-AZ / region configurations, HPA / VPA / KEDA for optimized scaling, and resilient deployment strategies like canary, blue-green, and rollback.
- Cluster Hardening & Optimization : Drive Cluster hardening initiatives (network, identity, policy). Optimize resource utilization and service density. Manage ingress traffic using AGIC / Nginx.
- Observability Implementation : Implement comprehensive observability solutions utilizing Metrics, traces, and logs via Azure Monitor, App Insights, Log Analytics, Prometheus, Grafana, and OpenTelemetry. Ensure Alerts on symptoms, not noise.
- Automation and Infrastructure as Code (IaC) : Automate platform provisioning using Terraform / Bicep. Implement GitOps (Flux / Argo) principles for deployment management and enforce compliance using Azure Policy / OPA Gatekeeper. Automate toil & build self-service runbooks / chatops.
- Performance & Capacity Planning : Conduct rigorous Load testing. Optimize platform autoscaling strategies and collaborate with FinOps to optimize cloud cost.
- Disaster Recovery and Testing : Define RTO / RPO objectives. Ensure compliance by executing regular chaos drills & game days to validate resilience.
- Security and Governance : Implement Security best practices leveraging Entra ID (Azure AD), Key Vault rotation, VNets / NSGs, and driving shift-left security practices within the CI pipeline.
Mandatory Skills & Qualifications :
Experience : 10+ years of professional experience in Site Reliability or DevOps. Must have previously worked as a DevOps engineer and at present working as SRE.Cloud Platform : Strong experience in Azure.Container Orchestration : Strong experience with AKS (Azure Kubernetes Service) and Experience working in docker.Database : Experience working on PostgreSQL (or similar enterprise-grade databases).Observability : Strong experience with observability practices and tools (e.g., Azure Monitor, Grafana, Prometheus, App Insights).IaC & Automation : Hands-on expertise with Terraform / Bicep and GitOps principles.Preferred Skills :
Deep familiarity with Entra ID, Azure Policy, and Key Vault security integration.Experience implementing OpenTelemetry standards for distributed tracing.Certifications related to Azure or Kubernetes (e.g., Azure Administrator, CKA / CKAD).(ref : hirist.tech)