Talent.com
Senior Site Reliability Engineer - Azure Kubernetes Service

Senior Site Reliability Engineer - Azure Kubernetes Service

PeoplefyTrivandrum
2 days ago
Job description

Description :

Site Reliability Engineer (SRE) - Azure / AKS Lead

Role Overview :

This is a senior technical leadership role for a Site Reliability Engineer (SRE) requiring 10+ years of experience, focused on owning and driving reliability for mission-critical, high-scale services deployed on Microsoft Azure.

The role demands prior experience as a DevOps Engineer transitioning into a dedicated SRE function. The incumbent must possess expert knowledge in Azure, AKS (Azure Kubernetes Service), and modern reliability practices including defining and enforcing SLIs / SLOs.

Based in Trivandrum, this SRE will shape technical standards, lead major incident response, and champion engineering excellence across multiple development teams.

Job Summary :

We are seeking an experienced SRE Lead (10+ years) with strong background in Azure and AKS to ensure the highest levels of availability, performance, and scalability for our Tier-0 / Tier-1 services.

This role is responsible for establishing and maintaining core SRE practices, including defining error budgets, implementing multi-burn-rate alerting, driving continuous automation (Terraform / GitOps), and leading critical incident response with calm clarity. Expertise in observability, disaster recovery design (RTO / RPO), and cluster hardening is mandatory.

Key Responsibilities and Reliability Engineering Deliverables :

  • Service Level Management : Define SLIs / SLOs for Tier-0 / Tier-1 services and conduct quarterly reviews. Implement multi-window, multi-burn-rate alerts to precisely detect evolving service degradation.
  • Error Budgeting and Change Gating : Enforce reliability constraints by implementing Change gating via CI / CD based on error budgets (using tools like Azure DevOps / GitHub Actions). Conduct weekly SLO reviews & drive the reliability roadmap.
  • Incident Management Command : Lead SEV1 / SEV2 incidents as the Incident Commander, taking ownership of rapid resolution, clear communication & postmortems. Ensure all corrective actions are implemented effectively.
  • Reliability Architecture & Kubernetes : Design and implement robust reliability patterns including DR (Disaster Recovery), multi-AZ / region configurations, HPA / VPA / KEDA for optimized scaling, and resilient deployment strategies like canary, blue-green, and rollback.
  • Cluster Hardening & Optimization : Drive Cluster hardening initiatives (network, identity, policy). Optimize resource utilization and service density. Manage ingress traffic using AGIC / Nginx.
  • Observability Implementation : Implement comprehensive observability solutions utilizing Metrics, traces, and logs via Azure Monitor, App Insights, Log Analytics, Prometheus, Grafana, and OpenTelemetry. Ensure Alerts on symptoms, not noise.
  • Automation and Infrastructure as Code (IaC) : Automate platform provisioning using Terraform / Bicep. Implement GitOps (Flux / Argo) principles for deployment management and enforce compliance using Azure Policy / OPA Gatekeeper. Automate toil & build self-service runbooks / chatops.
  • Performance & Capacity Planning : Conduct rigorous Load testing. Optimize platform autoscaling strategies and collaborate with FinOps to optimize cloud cost.
  • Disaster Recovery and Testing : Define RTO / RPO objectives. Ensure compliance by executing regular chaos drills & game days to validate resilience.
  • Security and Governance : Implement Security best practices leveraging Entra ID (Azure AD), Key Vault rotation, VNets / NSGs, and driving shift-left security practices within the CI pipeline.

Mandatory Skills & Qualifications :

  • Experience : 10+ years of professional experience in Site Reliability or DevOps. Must have previously worked as a DevOps engineer and at present working as SRE.
  • Cloud Platform : Strong experience in Azure.
  • Container Orchestration : Strong experience with AKS (Azure Kubernetes Service) and Experience working in docker.
  • Database : Experience working on PostgreSQL (or similar enterprise-grade databases).
  • Observability : Strong experience with observability practices and tools (e.g., Azure Monitor, Grafana, Prometheus, App Insights).
  • IaC & Automation : Hands-on expertise with Terraform / Bicep and GitOps principles.
  • Preferred Skills :

  • Deep familiarity with Entra ID, Azure Policy, and Key Vault security integration.
  • Experience implementing OpenTelemetry standards for distributed tracing.
  • Certifications related to Azure or Kubernetes (e.g., Azure Administrator, CKA / CKAD).
  • (ref : hirist.tech)

    Create a job alert for this search

    Senior Site Reliability Engineer • Trivandrum