Senior Site Reliability Engineer (SRE II)
Own availability, latency, performance, and efficiency for Zafin’s SaaS on Azure. You’ll define and enforce reliability standards, lead high-impact projects, mentor engineers, and eliminate toil at scale.
Reports to the Director of SRE.
What you’ll do
SLIs / SLOs & contracts :
Define customer-centric SLIs / SLOs for Tier-0 / Tier-1 services. Publish, review quarterly, and align teams to them.
Error budgeting (policy & tooling) :
Run the error-budget policy with multi-window, multi-burn-rate alerts; clear runbooks and paging thresholds.
Gate changes by budget status (freeze / relax rules) wired into CI / CD.
Maintain SLO / EB dashboards (Azure Monitor, Grafana / Prometheus, App Insights). Run weekly SLO reviews with engineering / product.
Drive roadmap tradeoffs when budgets are at risk; land reliability epics.
Incidents without drama :
Lead SEV1 / SEV2, own comms, run blameless postmortems, and make corrective actions stick.
Engineer reliability in :
Multi-AZ / region patterns (active-active / DR), PDBs / Pod Topology Spread, HPA / VPA / KEDA, resilient rollout / rollback.
AKS at scale :
Harden clusters (network, identity, policy), optimize node / pod density, ingress (AGIC / Nginx); mesh optional.
Observability that works :
Metrics / traces / logs with Azure Monitor / App Insights, Log Analytics, Prometheus / Grafana, OpenTelemetry. Alert on symptoms, not noise.
IaC & policy :
Terraform / Bicep modules, GitOps (Flux / Argo), policy-as-code (Azure Policy / OPA Gatekeeper). No snowflakes.
CI / CD reliability :
Azure DevOps / GitHub Actions with canary / blue-green, progressive delivery, auto-rollback, Key Vault-backed secrets.
Capacity & performance :
Load testing, right-sizing, autoscaling; partner with FinOps to reduce spend without hurting SLOs.
DR you can trust :
Define RTO / RPO, test backups / restore, run game days / chaos drills, validate ASR and multi-region failover.
Secure by default :
Entra ID (Azure AD), managed identities, Key Vault rotation, VNets / NSGs / Private Link, shift-left checks in CI.
Reduce toil :
Automate recurring ops, build self-service runbooks / chatops, publish golden paths for product teams.
Customer escalations :
Be the technical owner on calls; communicate tradeoffs and recovery plans with authority.
Document to scale :
Architectures, runbooks, postmortems, SLIs / SLOs—kept current and discoverable.
(If applicable) Streaming / ETL reliability :
Apply SRE practices (SLOs, backpressure, idempotency, replay) to NiFi / Flink / Kafka / Redpanda data flows.
Minimum qualifications
Bachelor’s in CS / Engineering (or equivalent experience).
12+ years
in production ops / platform / SRE, including
5+ years on Azure .
PostgreSQL (must-have) :
Deep operational expertise incl. HA / DR, logical / physical replication, performance tuning (indexes / EXPLAIN / ANALYZE, pg_stat_statements), autovacuum strategy, partitioning, backup / restore testing, and connection pooling (pgBouncer). Prefer experience with
Azure Database for PostgreSQL – Flexible Server .
Azure core :
AKS (must-have) ; Front Door / App Gateway, API Management, VNets / NSGs / Private Link, Storage, Key Vault, Redis, Service Bus / Event Hubs.
Observability : Azure Monitor / App Insights, Log Analytics, Prometheus / Grafana; SLO design and error-budget operations.
IaC / automation : Terraform and / or Bicep; PowerShell and Python; GitOps (Flux / Argo). Pipelines in Azure DevOps or GitHub Actions.
Proven incident leadership at scale, blameless postmortems, and SLO / error-budget governance with change gating.
Mentorship and crisp written / verbal communication.
Preferred (nice to have)
Apache NiFi ,
Apache Flink ,
Apache Kafka
or
Redpanda
(self-managed on AKS or managed equivalents); schema management, exactly-once semantics, backpressure, dead-letter / replay patterns.
Azure Solutions Architect Expert , CKA / CKAD.
ITSM (ServiceNow), on-call tooling (PagerDuty / Opsgenie).
Compliance / SecOps (SOC 2, ISO 27001), policy-as-code, workload identity.
OpenTelemetry, eBPF tooling, or service mesh.
Multi-tenant SaaS and cost optimization at scale.
Site Reliability Engineer • India