Talent.com
This job offer is not available in your country.
Site Reliability Engineer II

Site Reliability Engineer II

ZafinTrivandrum, Kerala, India
16 days ago
Job description

Senior Site Reliability Engineer (SRE II)

Own availability, latency, performance, and efficiency for Zafin’s SaaS on Azure. You’ll define and enforce reliability standards, lead high-impact projects, mentor engineers, and eliminate toil at scale. Reports to the Director of SRE.

What you’ll do

  • SLIs / SLOs & contracts : Define customer-centric SLIs / SLOs for Tier-0 / Tier-1 services. Publish, review quarterly, and align teams to them.
  • Error budgeting (policy & tooling) :
  • Run the error-budget policy with multi-window, multi-burn-rate alerts; clear runbooks and paging thresholds.
  • Gate changes by budget status (freeze / relax rules) wired into CI / CD.
  • Maintain SLO / EB dashboards (Azure Monitor, Grafana / Prometheus, App Insights). Run weekly SLO reviews with engineering / product.
  • Drive roadmap tradeoffs when budgets are at risk; land reliability epics.
  • Incidents without drama : Lead SEV1 / SEV2, own comms, run blameless postmortems, and make corrective actions stick.
  • Engineer reliability in : Multi-AZ / region patterns (active-active / DR), PDBs / Pod Topology Spread, HPA / VPA / KEDA, resilient rollout / rollback.
  • AKS at scale : Harden clusters (network, identity, policy), optimize node / pod density, ingress (AGIC / Nginx); mesh optional.
  • Observability that works : Metrics / traces / logs with Azure Monitor / App Insights, Log Analytics, Prometheus / Grafana, OpenTelemetry. Alert on symptoms, not noise.
  • IaC & policy : Terraform / Bicep modules, GitOps (Flux / Argo), policy-as-code (Azure Policy / OPA Gatekeeper). No snowflakes.
  • CI / CD reliability : Azure DevOps / GitHub Actions with canary / blue-green, progressive delivery, auto-rollback, Key Vault-backed secrets.
  • Capacity & performance : Load testing, right-sizing, autoscaling; partner with FinOps to reduce spend without hurting SLOs.
  • DR you can trust : Define RTO / RPO, test backups / restore, run game days / chaos drills, validate ASR and multi-region failover.
  • Secure by default : Entra ID (Azure AD), managed identities, Key Vault rotation, VNets / NSGs / Private Link, shift-left checks in CI.
  • Reduce toil : Automate recurring ops, build self-service runbooks / chatops, publish golden paths for product teams.
  • Customer escalations : Be the technical owner on calls; communicate tradeoffs and recovery plans with authority.
  • Document to scale : Architectures, runbooks, postmortems, SLIs / SLOs—kept current and discoverable.
  • (If applicable) Streaming / ETL reliability : Apply SRE practices (SLOs, backpressure, idempotency, replay) to NiFi / Flink / Kafka / Redpanda data flows.

Minimum qualifications

  • Bachelor’s in CS / Engineering (or equivalent experience).
  • 12+ years in production ops / platform / SRE, including 5+ years on Azure .
  • PostgreSQL (must-have) : Deep operational expertise incl. HA / DR, logical / physical replication, performance tuning (indexes / EXPLAIN / ANALYZE, pg_stat_statements), autovacuum strategy, partitioning, backup / restore testing, and connection pooling (pgBouncer). Prefer experience with Azure Database for PostgreSQL – Flexible Server .
  • Azure core : AKS (must-have) ; Front Door / App Gateway, API Management, VNets / NSGs / Private Link, Storage, Key Vault, Redis, Service Bus / Event Hubs.
  • Observability : Azure Monitor / App Insights, Log Analytics, Prometheus / Grafana; SLO design and error-budget operations.
  • IaC / automation : Terraform and / or Bicep; PowerShell and Python; GitOps (Flux / Argo). Pipelines in Azure DevOps or GitHub Actions.
  • Proven incident leadership at scale, blameless postmortems, and SLO / error-budget governance with change gating.
  • Mentorship and crisp written / verbal communication.
  • Preferred (nice to have)

  • Apache NiFi , Apache Flink , Apache Kafka or Redpanda (self-managed on AKS or managed equivalents); schema management, exactly-once semantics, backpressure, dead-letter / replay patterns.
  • Azure Solutions Architect Expert , CKA / CKAD.
  • ITSM (ServiceNow), on-call tooling (PagerDuty / Opsgenie).
  • Compliance / SecOps (SOC 2, ISO 27001), policy-as-code, workload identity.
  • OpenTelemetry, eBPF tooling, or service mesh.
  • Multi-tenant SaaS and cost optimization at scale.
  • Create a job alert for this search

    Site Reliability Engineer • Trivandrum, Kerala, India

    Related jobs
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ConcordThiruvananthapuram, IN
    Engineers (Individual Contributors).Strong SRE (Site Reliability Engineering).CI / CD, monitoring, automation, infrastructure as code, etc.Show moreLast updated: 18 days ago
    • Promoted
    Equifax - Senior Site Reliability Engineer - IAC Terraform

    Equifax - Senior Site Reliability Engineer - IAC Terraform

    EquifaxTrivandrum
    About the job Site Reliability Engineering (SRE) at Equifax is a discipline that combines software and systems engineering for building and running large-scale, distr...Show moreLast updated: 10 days ago
    • Promoted
    Senior DevOps / Site Reliability Engineer

    Senior DevOps / Site Reliability Engineer

    Scoop Technologies Pvt LtdThiruvananthapuram
    Job Title : Senior DevOps Engineer / Site Reliability Engineer (SRE) Experience : 5 to 8 Years &...Show moreLast updated: 27 days ago
    • Promoted
    Equifax - Site Reliability Engineer

    Equifax - Site Reliability Engineer

    EquifaxThiruvananthapuram
    Site Reliability Engineering (SRE) at Equifax SRE is a discipline that combines software and systems engineering for building and running large-scale, distrib...Show moreLast updated: 30+ days ago
    • Promoted
    L3 O365 Engineer

    L3 O365 Engineer

    Nextbridge IT SolutionsThiruvananthapuram, IN
    We are seeking a highly skilled .This senior role is a critical escalation point for complex issues, driving the resolution of major incidents and ensuring the seamless operation, security, and pro...Show moreLast updated: 8 days ago
    • Promoted
    Lead Sustenance Engineer - Storage

    Lead Sustenance Engineer - Storage

    DDNThiruvananthapuram, IN
    This is an incredible opportunity to be part of a company that has been at the forefront of AI and high-performance data storage innovation for over two decades. DataDirect Networks (DDN) is a globa...Show moreLast updated: 8 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    UplersThiruvananthapuram, IN
    Uplers is hiring for one of the clients.SRE (Oracle Cloud Infrastructure).Remote | Mon–Fri | 10 : 30 AM – 7 : 30 PM IST.Use of personal device required. OCI cloud infrastructure using Terraform and GitL...Show moreLast updated: 24 days ago
    • Promoted
    Site Reliability Engineer - Chaos Management

    Site Reliability Engineer - Chaos Management

    Xebiathiruvananthapuram, kerala, in
    AWS Engineer with strong Python development and Chaos Engineering expertise.The ideal candidate will combine cloud engineering, DevOps, and chaos experimentation to improve reliability, fault toler...Show moreLast updated: 8 days ago
    • Promoted
    Senior MLOps Engineer

    Senior MLOps Engineer

    Mitchell Martin Inc.Thiruvananthapuram, IN
    Include, but are not limited to, the following : .Own productionizing models—from tracked experiments to governed releases—ensuring resilient services with clear SLOs, runbooks, and fast, safe rollba...Show moreLast updated: 20 days ago
    • Promoted
    Senior Site Reliability Engineer- ELK Expert

    Senior Site Reliability Engineer- ELK Expert

    iVedha Inc.Thiruvananthapuram, IN
    Senior Site Reliability Engineer (SRE) – ELK Expert | Platform Engineering Practice.Must be available to work in the EST (US / Canada) Time Zone. Are you a Senior Site Reliability Engineer (SRE) with ...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    D&E Engineer

    D&E Engineer

    Eki.StructThiruvananthapuram, IN
    The Company’s Equal Opportunities policy applies equally to the recruitment process and must be complied with at every stage of the recruitment process. This means that prospective applicants should...Show moreLast updated: 14 hours ago
    • Promoted
    DevOps / Platform Engineer

    DevOps / Platform Engineer

    iVedha Inc.Thiruvananthapuram, IN
    Hiring a seasoned DevOps / Platform Engineer to drive automation, platform reliability, and robust.Design, deploy, and manage CI / CD pipelines and infrastructure automation, leveraging AI for.Implemen...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Engineer

    Engineer

    Nextbridge IT SolutionsThiruvananthapuram, IN
    We are seeking an experienced subject matter expertise in the Fortinet.This critical role is centered on high-severity incident management, complex security troubleshooting, and architectural impro...Show moreLast updated: 14 hours ago
    • Promoted
    L4 UC Engineer

    L4 UC Engineer

    Servion Global SolutionsThiruvananthapuram, IN
    UC Architecture & Design : Deep understanding of Unified Communications Products like CUCM, CUC, IM & Presence, and Expressways. Deep knowledge of designing and troubleshooting clusters, inter-cluste...Show moreLast updated: 18 days ago
    • Promoted
    System Engineer

    System Engineer

    Next VenturesThiruvananthapuram, IN
    Offshore Systems Engineer – VMware & Azure.We’re seeking a highly skilled.This role is ideal for someone who thrives in dynamic environments, stays ahead of emerging tech trends, and can drive inno...Show moreLast updated: 5 days ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    ExasoftThiruvananthapuram, IN
    Responsibilities and Requirements : .Experience must be at least 10+ years in SRE.Multi Cloud, Hybrid Cloud – on Data center sites. Experience with multiple operating systems (.Operating Systems, Kern...Show moreLast updated: 14 hours ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    XebiaThiruvananthapuram, IN
    AWS Engineer with strong Python development and Chaos Engineering expertise.The ideal candidate will combine cloud engineering, DevOps, and chaos experimentation to improve reliability, fault toler...Show moreLast updated: 26 days ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    BayOne Solutionsthiruvananthapuram, kerala, in
    Role : Site Reliability Engineer.The CXE Site Reliability Engineering (SRE) team manages the CI / CD pipelines and cloud infrastructure, ensuring seamless deployment, monitoring, and maintenance.Howev...Show moreLast updated: 10 hours ago
    • Promoted
    Deployment Engineer

    Deployment Engineer

    AvocaThiruvananthapuram, IN
    Build, launch & optimize AI agents that power the next generation of home-service customer experiences.Avoca is the all-in-one AI lead-conversion platform. Our technology boosts booking rates, slash...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer II

    Site Reliability Engineer II

    ZafinTrivandrum, Kerala, India
    Senior Site Reliability Engineer (SRE II).Own availability, latency, performance, and efficiency for Zafin’s SaaS on Azure. You’ll define and enforce reliability standards, lead high-impact projects...Show moreLast updated: 17 days ago