Greetings from Peoplefy!
We’re looking for an SRE who can own reliability for mission-critical services on Azure , shape standards, lead incidents with calm clarity, and drive engineering excellence across teams
Experience : 10+ years
Location : Trivandrum
Responsibilities :
Strong site reliability experience
Previously worked as DevOps engineer and at present working as SRE
Strong experience in Azure
Strong experience with AKS
Experience working in docker
Experience with observability (Any tool)
Experience working on PostgreSQL
SLIs / SLOs & Error Budgets
Define SLIs / SLOs for Tier-0 / Tier-1 services & review quarterly
Implement multi-window, multi-burn-rate alerts
Change gating via CI / CD based on error budgets
Maintain Azure Monitor / Grafana / Prometheus / App Insights dashboards
Conduct weekly SLO reviews & drive reliability roadmap
Incident Management
Lead SEV1 / SEV2 incidents , own communication & postmortems
Ensure corrective actions are implemented
Reliability Engineering
Implement DR, multi-AZ / region patterns, HPA / VPA / KEDA, resilient rollouts
Cluster hardening (network, identity, policy), optimize density
Ingress : AGIC / Nginx
Observability
Metrics, traces, logs via Azure Monitor, App Insights, Log Analytics, Prometheus, Grafana, OpenTelemetry
Alerts on symptoms, not noise
Automation & IaC
Terraform / Bicep , GitOps (Flux / Argo) , Azure Policy / OPA Gatekeeper
Automate toil & build self-service runbooks / chatops
CI / CD Reliability
Azure DevOps / GitHub Actions with canary, blue-green, rollback
Key Vault-backed secrets
Performance & Capacity
Load testing, autoscaling, FinOps collaboration
Disaster Recovery
Define RTO / RPO , run chaos drills & game days
Security
Entra ID, Key Vault rotation, VNets / NSGs, shift-left security in CI
Documentation
Runbooks, SLOs, postmortems, architectures — kept current & accessible
Interested candidates please share your updated resumes on
Senior Site Reliability Engineer • Trivandrum, India