Senior Site Reliability Engineer - Azure Kubernetes Service

PeoplefyTrivandrum

2 days ago

Job description

Description :

Site Reliability Engineer (SRE) - Azure / AKS Lead

Role Overview :

This is a senior technical leadership role for a Site Reliability Engineer (SRE) requiring 10+ years of experience, focused on owning and driving reliability for mission-critical, high-scale services deployed on Microsoft Azure.

The role demands prior experience as a DevOps Engineer transitioning into a dedicated SRE function. The incumbent must possess expert knowledge in Azure, AKS (Azure Kubernetes Service), and modern reliability practices including defining and enforcing SLIs / SLOs.

Based in Trivandrum, this SRE will shape technical standards, lead major incident response, and champion engineering excellence across multiple development teams.

Job Summary :

We are seeking an experienced SRE Lead (10+ years) with strong background in Azure and AKS to ensure the highest levels of availability, performance, and scalability for our Tier-0 / Tier-1 services.

This role is responsible for establishing and maintaining core SRE practices, including defining error budgets, implementing multi-burn-rate alerting, driving continuous automation (Terraform / GitOps), and leading critical incident response with calm clarity. Expertise in observability, disaster recovery design (RTO / RPO), and cluster hardening is mandatory.

Key Responsibilities and Reliability Engineering Deliverables :

Service Level Management : Define SLIs / SLOs for Tier-0 / Tier-1 services and conduct quarterly reviews. Implement multi-window, multi-burn-rate alerts to precisely detect evolving service degradation.
Error Budgeting and Change Gating : Enforce reliability constraints by implementing Change gating via CI / CD based on error budgets (using tools like Azure DevOps / GitHub Actions). Conduct weekly SLO reviews & drive the reliability roadmap.
Incident Management Command : Lead SEV1 / SEV2 incidents as the Incident Commander, taking ownership of rapid resolution, clear communication & postmortems. Ensure all corrective actions are implemented effectively.
Reliability Architecture & Kubernetes : Design and implement robust reliability patterns including DR (Disaster Recovery), multi-AZ / region configurations, HPA / VPA / KEDA for optimized scaling, and resilient deployment strategies like canary, blue-green, and rollback.
Cluster Hardening & Optimization : Drive Cluster hardening initiatives (network, identity, policy). Optimize resource utilization and service density. Manage ingress traffic using AGIC / Nginx.
Observability Implementation : Implement comprehensive observability solutions utilizing Metrics, traces, and logs via Azure Monitor, App Insights, Log Analytics, Prometheus, Grafana, and OpenTelemetry. Ensure Alerts on symptoms, not noise.
Automation and Infrastructure as Code (IaC) : Automate platform provisioning using Terraform / Bicep. Implement GitOps (Flux / Argo) principles for deployment management and enforce compliance using Azure Policy / OPA Gatekeeper. Automate toil & build self-service runbooks / chatops.
Performance & Capacity Planning : Conduct rigorous Load testing. Optimize platform autoscaling strategies and collaborate with FinOps to optimize cloud cost.
Disaster Recovery and Testing : Define RTO / RPO objectives. Ensure compliance by executing regular chaos drills & game days to validate resilience.
Security and Governance : Implement Security best practices leveraging Entra ID (Azure AD), Key Vault rotation, VNets / NSGs, and driving shift-left security practices within the CI pipeline.

Mandatory Skills & Qualifications :

Experience : 10+ years of professional experience in Site Reliability or DevOps. Must have previously worked as a DevOps engineer and at present working as SRE.

Cloud Platform : Strong experience in Azure.

Container Orchestration : Strong experience with AKS (Azure Kubernetes Service) and Experience working in docker.

Database : Experience working on PostgreSQL (or similar enterprise-grade databases).

Observability : Strong experience with observability practices and tools (e.g., Azure Monitor, Grafana, Prometheus, App Insights).

IaC & Automation : Hands-on expertise with Terraform / Bicep and GitOps principles.

Preferred Skills :

Deep familiarity with Entra ID, Azure Policy, and Key Vault security integration.

Experience implementing OpenTelemetry standards for distributed tracing.

Certifications related to Azure or Kubernetes (e.g., Azure Administrator, CKA / CKAD).

(ref : hirist.tech)

Create a job alert for this search

Senior Site Reliability Engineer • Trivandrum