Talent.com
This job offer is not available in your country.
Site Reliability Engineer II [Apply Now]

Site Reliability Engineer II [Apply Now]

ZafinTrivandrum, Kerala, India
6 hours ago
Job description

Senior Site Reliability Engineer (SRE II)

Own availability, latency, performance, and efficiency for Zafin’s SaaS on Azure. You’ll define and enforce reliability standards, lead high-impact projects, mentor engineers, and eliminate toil at scale. Reports to the Director of SRE.

What you’ll do

  • SLIs / SLOs & contracts : Define customer-centric SLIs / SLOs for Tier-0 / Tier-1 services. Publish, review quarterly, and align teams to them.
  • Error budgeting (policy & tooling) :
  • Run the error-budget policy with multi-window, multi-burn-rate alerts; clear runbooks and paging thresholds.
  • Gate changes by budget status (freeze / relax rules) wired into CI / CD.
  • Maintain SLO / EB dashboards (Azure Monitor, Grafana / Prometheus, App Insights). Run weekly SLO reviews with engineering / product.
  • Drive roadmap tradeoffs when budgets are at risk; land reliability epics.
  • Incidents without drama : Lead SEV1 / SEV2, own comms, run blameless postmortems, and make corrective actions stick.
  • Engineer reliability in : Multi-AZ / region patterns (active-active / DR), PDBs / Pod Topology Spread, HPA / VPA / KEDA, resilient rollout / rollback.
  • AKS at scale : Harden clusters (network, identity, policy), optimize node / pod density, ingress (AGIC / Nginx); mesh optional.
  • Observability that works : Metrics / traces / logs with Azure Monitor / App Insights, Log Analytics, Prometheus / Grafana, OpenTelemetry. Alert on symptoms, not noise.
  • IaC & policy : Terraform / Bicep modules, GitOps (Flux / Argo), policy-as-code (Azure Policy / OPA Gatekeeper). No snowflakes.
  • CI / CD reliability : Azure DevOps / GitHub Actions with canary / blue-green, progressive delivery, auto-rollback, Key Vault-backed secrets.
  • Capacity & performance : Load testing, right-sizing, autoscaling; partner with FinOps to reduce spend without hurting SLOs.
  • DR you can trust : Define RTO / RPO, test backups / restore, run game days / chaos drills, validate ASR and multi-region failover.
  • Secure by default : Entra ID (Azure AD), managed identities, Key Vault rotation, VNets / NSGs / Private Link, shift-left checks in CI.
  • Reduce toil : Automate recurring ops, build self-service runbooks / chatops, publish golden paths for product teams.
  • Customer escalations : Be the technical owner on calls; communicate tradeoffs and recovery plans with authority.
  • Document to scale : Architectures, runbooks, postmortems, SLIs / SLOs—kept current and discoverable.
  • (If applicable) Streaming / ETL reliability : Apply SRE practices (SLOs, backpressure, idempotency, replay) to NiFi / Flink / Kafka / Redpanda data flows.

Minimum qualifications

  • Bachelor’s in CS / Engineering (or equivalent experience).
  • 12+ years in production ops / platform / SRE, including 5+ years on Azure.
  • PostgreSQL (must-have) : Deep operational expertise incl. HA / DR, logical / physical replication, performance tuning (indexes / EXPLAIN / ANALYZE, pg_stat_statements), autovacuum strategy, partitioning, backup / restore testing, and connection pooling (pgBouncer). Prefer experience with Azure Database for PostgreSQL – Flexible Server.
  • Azure core : AKS (must-have); Front Door / App Gateway, API Management, VNets / NSGs / Private Link, Storage, Key Vault, Redis, Service Bus / Event Hubs.
  • Observability : Azure Monitor / App Insights, Log Analytics, Prometheus / Grafana; SLO design and error-budget operations.
  • IaC / automation : Terraform and / or Bicep; PowerShell and Python; GitOps (Flux / Argo). Pipelines in Azure DevOps or GitHub Actions.
  • Proven incident leadership at scale, blameless postmortems, and SLO / error-budget governance with change gating.
  • Mentorship and crisp written / verbal communication.
  • Preferred (nice to have)

  • Apache NiFi, Apache Flink, Apache Kafka or Redpanda (self-managed on AKS or managed equivalents); schema management, exactly-once semantics, backpressure, dead-letter / replay patterns.
  • Azure Solutions Architect Expert, CKA / CKAD.
  • ITSM (ServiceNow), on-call tooling (PagerDuty / Opsgenie).
  • Compliance / SecOps (SOC 2, ISO 27001), policy-as-code, workload identity.
  • OpenTelemetry, eBPF tooling, or service mesh.
  • Multi-tenant SaaS and cost optimization at scale.
  • Create a job alert for this search

    Site Reliability Engineer • Trivandrum, Kerala, India

    Related jobs
    • Promoted
    Senior DevOps / Site Reliability Engineer

    Senior DevOps / Site Reliability Engineer

    Scoop Technologies Pvt LtdThiruvananthapuram
    Job Title : Senior DevOps Engineer / Site Reliability Engineer (SRE) Experience : 5 to 8 Years &...Show moreLast updated: 30+ days ago
    • Promoted
    Resident Engineer – Kubernetes & Portworx

    Resident Engineer – Kubernetes & Portworx

    CMK Resources, Inc.Thiruvananthapuram, IN
    CMK Resources Resident Engineer – Kubernetes & Portworx (3 openings).Help Shape the Future of Kubernetes Storage.Our client's largest and most strategic customer is moving VMware-based workloads to...Show moreLast updated: 16 days ago
    • Promoted
    Equifax - Site Reliability Engineer

    Equifax - Site Reliability Engineer

    EquifaxThiruvananthapuram
    Site Reliability Engineering (SRE) at Equifax SRE is a discipline that combines software and systems engineering for building and running large-scale, distrib...Show moreLast updated: 30+ days ago
    • Promoted
    Senior IAM Engineer

    Senior IAM Engineer

    ATCThiruvananthapuram, IN
    IAM Senior Engineer (CIAM & PAM – CyberArk).The IAM Senior Engineer will be responsible for the design, build, deployment, and support of Customer Identity & Access Management (CIAM) and Privileged...Show moreLast updated: 9 days ago
    • Promoted
    Site Reliability Engineer - Chaos Management

    Site Reliability Engineer - Chaos Management

    Xebiathiruvananthapuram, kerala, in
    AWS Engineer with strong Python development and Chaos Engineering expertise.The ideal candidate will combine cloud engineering, DevOps, and chaos experimentation to improve reliability, fault toler...Show moreLast updated: 16 days ago
    • Promoted
    Site Reliability Engineer (SRE)

    Site Reliability Engineer (SRE)

    ConfidentialThiruvananthapuram / Trivandrum
    As a Site Reliability Engineer (SRE) you will be responsible for improving the overall reliability of applications by ensuring its availability, performance, and scalability.Should be able to gathe...Show moreLast updated: 14 days ago
    • Promoted
    Senior MLOps Engineer

    Senior MLOps Engineer

    Mitchell Martin Inc.Thiruvananthapuram, IN
    Include, but are not limited to, the following : .Own productionizing models—from tracked experiments to governed releases—ensuring resilient services with clear SLOs, runbooks, and fast, safe rollba...Show moreLast updated: 29 days ago
    • Promoted
    Infrastructure Engineer (AWS & Typescript)

    Infrastructure Engineer (AWS & Typescript)

    Crossing HurdlesThiruvananthapuram, IN
    AWS CDK + TypeScript - Infrastructure Engineer.We are seeking an experienced Infrastructure Engineer specializing in AWS CDK and TypeScript to design, develop, and maintain scalable cloud infrastru...Show moreLast updated: 4 days ago
    • Promoted
    Senior Site Reliability Engineer- ELK Expert

    Senior Site Reliability Engineer- ELK Expert

    iVedha Inc.Thiruvananthapuram, IN
    Senior Site Reliability Engineer (SRE) – ELK Expert | Platform Engineering Practice.Must be available to work in the EST (US / Canada) Time Zone. Are you a Senior Site Reliability Engineer (SRE) with ...Show moreLast updated: 30+ days ago
    • Promoted
    DevOps / Platform Engineer

    DevOps / Platform Engineer

    iVedha Inc.Thiruvananthapuram, IN
    Hiring a seasoned DevOps / Platform Engineer to drive automation, platform reliability, and robust.Design, deploy, and manage CI / CD pipelines and infrastructure automation, leveraging AI for.Implemen...Show moreLast updated: 30+ days ago
    • Promoted
    Cloud Engineer

    Cloud Engineer

    DBiz.aiThiruvananthapuram, IN
    We are seeking a dynamic and skilled AWS Cloud & DevOps Engineer to design, implement, and maintain scalable, secure, and automated cloud environments on Amazon Web Services.The ideal candidate wil...Show moreLast updated: 16 days ago
    • Promoted
    Senior Software Engineer II -SSE II

    Senior Software Engineer II -SSE II

    First American (India)Thiruvananthapuram, IN
    Job Title : Senior Software Engineer II (6+ Years Experience).Join the Platform Engineering team to spearhead the creation of robust cloud platforms. Embracing an automation-first approach, you will ...Show moreLast updated: 8 days ago
    • Promoted
    L4 UC Engineer

    L4 UC Engineer

    Servion Global SolutionsThiruvananthapuram, IN
    UC Architecture & Design : Deep understanding of Unified Communications Products like CUCM, CUC, IM & Presence, and Expressways. Deep knowledge of designing and troubleshooting clusters, inter-cluste...Show moreLast updated: 27 days ago
    • Promoted
    Rotating Equipment Reliability Consultant / Trainer

    Rotating Equipment Reliability Consultant / Trainer

    EC-Energy EventsThiruvananthapuram, IN
    EC-Energy Events is looking for an experienced Rotating Equipment Reliability Consultant / Trainer to join our growing pool of experts supporting technical conferences, training programs, and worksho...Show moreLast updated: 4 days ago
    • Promoted
    Assistant Manager - health & safety

    Assistant Manager - health & safety

    CoatsAmbasamudram, Tamil Nadu, India
    Coats is a world leader in thread manufacturing and structural components for apparel and footwear, as well as an innovative pioneer in performance materials. These critical solutions are used to cr...Show moreLast updated: 16 days ago
    • Promoted
    Equifax - Senior Site Reliability Engineer - IAC Terraform

    Equifax - Senior Site Reliability Engineer - IAC Terraform

    EquifaxThiruvananthapuram
    About the job Site Reliability Engineering (SRE) at Equifax is a discipline that combines software and systems engineering for building and running large-scale, distr...Show moreLast updated: 18 days ago
    • Promoted
    Deployment Engineer

    Deployment Engineer

    AvocaThiruvananthapuram, IN
    Build, launch & optimize AI agents that power the next generation of home-service customer experiences.Avoca is the all-in-one AI lead-conversion platform. Our technology boosts booking rates, slash...Show moreLast updated: 30+ days ago
    • Promoted
    System Engineer

    System Engineer

    HyqooThiruvananthapuram, IN
    Job Title : Systems Engineer L3.Duration : 12 months with high possible extension.Working time zone : Night Shifts (EST, CST, PST). Working hours 40 hours per week (8 hours per day).We are seeking a hi...Show moreLast updated: 29 days ago
    • Promoted
    Site Reliability Engineer II

    Site Reliability Engineer II

    ZafinTrivandrum, Kerala, India
    Senior Site Reliability Engineer (SRE II).Own availability, latency, performance, and efficiency for Zafin’s SaaS on Azure. You’ll define and enforce reliability standards, lead high-impact projects...Show moreLast updated: 26 days ago
    • Promoted
    Guidewire Claim center QA

    Guidewire Claim center QA

    Zensar TechnologiesThiruvananthapuram, IN
    GW CC QA is committed to delivering high-quality software solutions and fostering a culture of innovation and collaboration. The role involves testing and quality assurance for Claim Center, ensurin...Show moreLast updated: 27 days ago