Talent.com
This job offer is not available in your country.
Site Reliability Engineer

Site Reliability Engineer

iTheme Consulting Pvt LtdDelhi, IN
15 hours ago
Job type
  • Remote
Job description

Site Reliability Engineer (SRE)

We are hiring a Site Reliability Engineer (SRE) to support the night-time operations of a mission-critical banking platform for a US-based enterprise client. This is a permanent night shift role tailored for experienced engineers who thrive in production environments and bring a proactive approach to incident resolution and automation.

You will work on system monitoring, incident response, and platform stability-while also improving observability, creating automation scripts, and collaborating with developers and DevOps teams. You wont just respond to alerts-youll help prevent them.

Work Mode : Permanent Night Shift

Note : This is a fixed night shift role. Candidates must have prior experience or explicitly confirm readiness for permanent US-time zone shifts.

Key Responsibilities :

  • Monitor system health, SLIs / SLOs, and infrastructure using tools like Prometheus, Grafana, ELK, Stackdriver, etc.
  • Lead incident triage for P1 / P2 alerts, engage in war rooms, update tickets (JIRA / SNOW), and participate in post-incident RCA documentation.
  • Create or enhance automation scripts (Bash / Python) for log ingestion, alert suppression, auto-recovery, and health checks.
  • Analyze application runtime issues-such as JVM logs, memory usage, GC pauses, or thread deadlocks-to support root cause analysis.
  • Participate in daily DevOps / SRE standups, collaborating closely with engineering teams to improve production reliability.
  • Handle database performance alerts (Oracle / Postgres) and collaborate with DBAs or developers to resolve backend bottlenecks.
  • Track and interpret SLO breaches, availability metrics, and system latencies to enforce production SLAs.

Core Skills & Expertise : Technical Skills :

  • Experience with Grafana, Prometheus, ELK Stack, or Stackdriver. Able to define alerts, read logs, and correlate cross-system issues.
  • Full ownership of P1 / P2 incidents - including triage, ticketing, stakeholder communication, and RCA participation.
  • Proficient in Bash or Python scripting to automate routine SRE tasks and recovery workflows.
  • Experience managing production workloads on GCP, AWS, or Azure, with ability to inspect cloud logs, VM status, networking, and storage configurations.
  • Familiar with concepts like error budgets, latency thresholds, and SLO tracking. Capable of interpreting breaches and reporting anomalies.
  • Able to spot symptoms of JVM issues like GC pauses, memory leaks, thread contention, and raise appropriate diagnostics.
  • Identify backend delays or errors from logs and assist in pinpointing query or connection-related issues.
  • Strong communication skills to work with distributed teams during escalations, code fixes, or configuration changes.
  • Must be fully aligned to a permanent night shift (US time) and self-sufficient in a remote-first environment.
  • Nice-to-Have Skills :

  • Familiarity with ServiceNow, change advisory boards, rollback planning, and structured release processes.
  • Experience monitoring CPU, memory, and traffic metrics to recommend infrastructure scale-up / down strategies.
  • Exposure to embedding SRE gates, smoke tests, or health validations in CI pipelines like Jenkins or GitHub Actions.
  • Basic understanding of tools like SLO Generator or Datadog for automated budget tracking and alerting.
  • Can interpret Terraform code related to monitoring, infrastructure, or alert rules. Not required to author full modules.
  • Holding a GCP Associate Cloud Engineer or similar certification is a plus but not mandatory.
  • (ref : hirist.tech)

    Create a job alert for this search

    Site Reliability Engineer • Delhi, IN

    Related jobs
    • Promoted
    Site Reliability Engineer - Azure / Cloud Services

    Site Reliability Engineer - Azure / Cloud Services

    Leapwork India Private LimitedGurgaon
    At Leapwork, our vision is to break down the barriers between humans and computers through the worlds most accessible automation platform. We are the leading global AI-powered visual test automation...Show moreLast updated: 18 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ConcordGhaziabad, IN
    Engineers (Individual Contributors).Strong SRE (Site Reliability Engineering).CI / CD, monitoring, automation, infrastructure as code, etc.Show moreLast updated: 18 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    XebiaDelhi, IN
    AWS Engineer with strong Python development and Chaos Engineering expertise.The ideal candidate will combine cloud engineering, DevOps, and chaos experimentation to improve reliability, fault toler...Show moreLast updated: 27 days ago
    • Promoted
    Senior Site Reliability Engineer- ELK Expert

    Senior Site Reliability Engineer- ELK Expert

    iVedha Inc.Delhi, IN
    Senior Site Reliability Engineer (SRE) – ELK Expert | Platform Engineering Practice.Must be available to work in the EST (US / Canada) Time Zone. Are you a Senior Site Reliability Engineer (SRE) with ...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Amicon Hub Servicesnoida, delhi, in
    Manage and scale production systems hosted on.Automate operational tasks using.Improve system reliability and reduce manual interventions through automation. Collaborate with development teams to en...Show moreLast updated: 6 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    CorroHealthNoida, Uttar Pradesh, India
    We are seeking a highly skilled Site Reliability Engineer (SRE) to join our team.The ideal candidate will have a deep understanding of both software engineering and systems administration, with a f...Show moreLast updated: 19 days ago
    • Promoted
    Xebia - Senior / Lead / Principal Site Reliability Engineer

    Xebia - Senior / Lead / Principal Site Reliability Engineer

    Xebia IT Architects India Pvt LtdGurugram
    Role : Site Reliability Engineer Experience Range : 7 - 12 Years Location : Pune & Chennai, Bangalore , Gurgaon Mode of Work : Hyb...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    BayOne Solutionsnoida, delhi, in
    Role : Site Reliability Engineer.The CXE Site Reliability Engineering (SRE) team manages the CI / CD pipelines and cloud infrastructure, ensuring seamless deployment, monitoring, and maintenance.Howev...Show moreLast updated: 21 hours ago
    • Promoted
    • New!
    Azure Data Engineers - Site Reliability Engineering

    Azure Data Engineers - Site Reliability Engineering

    GSPANNgurugram, India
    Description GSPANN is hiring Azure Data Engineers with expertise in Site Reliability Engineering (SRE) to optimize and automate large-scale data applications. The role involves ensuring system relia...Show moreLast updated: less than 1 hour ago
    • Promoted
    RELX - Site Reliability Engineer - IAC Terraform

    RELX - Site Reliability Engineer - IAC Terraform

    REED ELSEVIER INDIA (a part of RELX India Pvt Ltd)Gurugram
    Job Description : - Lead initiatives to identify and eliminate manual, repetitive tasks through automation and tooling.Develop s...Show moreLast updated: 19 days ago
    • Promoted
    Site Reliability Engineer - Incident Management

    Site Reliability Engineer - Incident Management

    FxConsultingGurugram
    Job Title : Site Reliability Engineer Location : Gurgaon, India Experience : 6 to 9 years Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer - AWS / Azure Cloud Services

    Site Reliability Engineer - AWS / Azure Cloud Services

    SkyFlowDelhi, IN
    Skyflow is a data privacy vault company built to radically simplify how companies isolate, protect, and govern their customers most sensitive data. With its global network of data privacy vaults, Sk...Show moreLast updated: 7 days ago
    • Promoted
    Site Reliability Engineer - CI / CD

    Site Reliability Engineer - CI / CD

    hirezy.aiDelhi, IN
    Remote
    Technical Skills : - Programming : Proficiency in languages like Python, Bash, or Java is essential.Operating Systems : Deep understanding of Linux / Windows operating ...Show moreLast updated: 30+ days ago
    • Promoted
    Project Manager - Site Reliability

    Project Manager - Site Reliability

    Hudson RPODelhi, IN
    Role : SRE Project Manager Location : Gurugram The SRE Project Manager will be responsible for the planning, implementation, and tracking of SRE projects f...Show moreLast updated: 15 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ExasoftGhaziabad, IN
    Responsibilities and Requirements : .Experience must be at least 10+ years in SRE.Multi Cloud, Hybrid Cloud – on Data center sites. Experience with multiple operating systems (.Operating Systems, Kern...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer - Chaos Management

    Site Reliability Engineer - Chaos Management

    Xebiagurgaon, haryana, in
    AWS Engineer with strong Python development and Chaos Engineering expertise.The ideal candidate will combine cloud engineering, DevOps, and chaos experimentation to improve reliability, fault toler...Show moreLast updated: 8 days ago
    • Promoted
    Gemini Solutions - Site Reliability Engineer - Cloud Solutions

    Gemini Solutions - Site Reliability Engineer - Cloud Solutions

    Gemini Solutions Private LimitedGurugram
    Position Summary : In this role, you will play a crucial part in shaping the firm's infrastructure reliability and efficiency by implementing robust Site Reliab...Show moreLast updated: 22 days ago
    • Promoted
    Staff Engineer - Site Reliability

    Staff Engineer - Site Reliability

    DashhireDelhi, IN
    Remote
    Responsibilities : - The Site Reliability Engineering (SRE) team is responsible for the reliability, scalability, stability and performance of systems and services.Th...Show moreLast updated: 30+ days ago