Senior Site Reliability Engineer (SRE II)
Own availability, latency, performance, and efficiency for Zafin’s SaaS on Azure. You’ll define and enforce reliability standards, lead high-impact projects, mentor engineers, and eliminate toil at scale. Reports to the Director of SRE.
What you’ll do
- SLIs / SLOs & contracts : Define customer-centric SLIs / SLOs for Tier-0 / Tier-1 services. Publish, review quarterly, and align teams to them.
- Error budgeting (policy & tooling) :
- Run the error-budget policy with multi-window, multi-burn-rate alerts; clear runbooks and paging thresholds.
- Gate changes by budget status (freeze / relax rules) wired into CI / CD.
- Maintain SLO / EB dashboards (Azure Monitor, Grafana / Prometheus, App Insights). Run weekly SLO reviews with engineering / product.
- Drive roadmap tradeoffs when budgets are at risk; land reliability epics.
- Incidents without drama : Lead SEV1 / SEV2, own comms, run blameless postmortems, and make corrective actions stick.
- Engineer reliability in : Multi-AZ / region patterns (active-active / DR), PDBs / Pod Topology Spread, HPA / VPA / KEDA, resilient rollout / rollback.
- AKS at scale : Harden clusters (network, identity, policy), optimize node / pod density, ingress (AGIC / Nginx); mesh optional.
- Observability that works : Metrics / traces / logs with Azure Monitor / App Insights, Log Analytics, Prometheus / Grafana, OpenTelemetry. Alert on symptoms, not noise.
- IaC & policy : Terraform / Bicep modules, GitOps (Flux / Argo), policy-as-code (Azure Policy / OPA Gatekeeper). No snowflakes.
- CI / CD reliability : Azure DevOps / GitHub Actions with canary / blue-green, progressive delivery, auto-rollback, Key Vault-backed secrets.
- Capacity & performance : Load testing, right-sizing, autoscaling; partner with FinOps to reduce spend without hurting SLOs.
- DR you can trust : Define RTO / RPO, test backups / restore, run game days / chaos drills, validate ASR and multi-region failover.
- Secure by default : Entra ID (Azure AD), managed identities, Key Vault rotation, VNets / NSGs / Private Link, shift-left checks in CI.
- Reduce toil : Automate recurring ops, build self-service runbooks / chatops, publish golden paths for product teams.
- Customer escalations : Be the technical owner on calls; communicate tradeoffs and recovery plans with authority.
- Document to scale : Architectures, runbooks, postmortems, SLIs / SLOs—kept current and discoverable.
- (If applicable) Streaming / ETL reliability : Apply SRE practices (SLOs, backpressure, idempotency, replay) to NiFi / Flink / Kafka / Redpanda data flows.
Minimum qualifications
Bachelor’s in CS / Engineering (or equivalent experience).12+ years in production ops / platform / SRE, including 5+ years on Azure .PostgreSQL (must-have) : Deep operational expertise incl. HA / DR, logical / physical replication, performance tuning (indexes / EXPLAIN / ANALYZE, pg_stat_statements), autovacuum strategy, partitioning, backup / restore testing, and connection pooling (pgBouncer). Prefer experience with Azure Database for PostgreSQL – Flexible Server .Azure core : AKS (must-have) ; Front Door / App Gateway, API Management, VNets / NSGs / Private Link, Storage, Key Vault, Redis, Service Bus / Event Hubs.Observability : Azure Monitor / App Insights, Log Analytics, Prometheus / Grafana; SLO design and error-budget operations.IaC / automation : Terraform and / or Bicep; PowerShell and Python; GitOps (Flux / Argo). Pipelines in Azure DevOps or GitHub Actions.Proven incident leadership at scale, blameless postmortems, and SLO / error-budget governance with change gating.Mentorship and crisp written / verbal communication.Preferred (nice to have)
Apache NiFi , Apache Flink , Apache Kafka or Redpanda (self-managed on AKS or managed equivalents); schema management, exactly-once semantics, backpressure, dead-letter / replay patterns.Azure Solutions Architect Expert , CKA / CKAD.ITSM (ServiceNow), on-call tooling (PagerDuty / Opsgenie).Compliance / SecOps (SOC 2, ISO 27001), policy-as-code, workload identity.OpenTelemetry, eBPF tooling, or service mesh.Multi-tenant SaaS and cost optimization at scale.