Greetings from Peoplefy!
We’re looking for an SRE who can own reliability for mission-critical services on Azure , shape standards, lead incidents with calm clarity, and drive engineering excellence across teams
Experience : 10+ years
Location : Trivandrum
Responsibilities :
- Strong site reliability experience
- Previously worked as DevOps engineer and at present working as SRE
- Strong experience in Azure
- Strong experience with AKS
- Experience working in docker
- Experience with observability (Any tool)
- Experience working on PostgreSQL
SLIs / SLOs & Error Budgets
Define SLIs / SLOs for Tier-0 / Tier-1 services & review quarterlyImplement multi-window, multi-burn-rate alertsChange gating via CI / CD based on error budgetsMaintain Azure Monitor / Grafana / Prometheus / App Insights dashboardsConduct weekly SLO reviews & drive reliability roadmapIncident Management
Lead SEV1 / SEV2 incidents , own communication & postmortemsEnsure corrective actions are implementedReliability Engineering
Implement DR, multi-AZ / region patterns, HPA / VPA / KEDA, resilient rolloutsCluster hardening (network, identity, policy), optimize densityIngress : AGIC / NginxObservability
Metrics, traces, logs via Azure Monitor, App Insights, Log Analytics, Prometheus, Grafana, OpenTelemetryAlerts on symptoms, not noiseAutomation & IaC
Terraform / Bicep , GitOps (Flux / Argo) , Azure Policy / OPA GatekeeperAutomate toil & build self-service runbooks / chatopsCI / CD Reliability
Azure DevOps / GitHub Actions with canary, blue-green, rollbackKey Vault-backed secretsPerformance & Capacity
Load testing, autoscaling, FinOps collaborationDisaster Recovery
Define RTO / RPO , run chaos drills & game daysSecurity
Entra ID, Key Vault rotation, VNets / NSGs, shift-left security in CIDocumentation
Runbooks, SLOs, postmortems, architectures — kept current & accessibleInterested candidates please share your updated resumes on amruta.bu@peoplefy.com