Site Reliability Engineer (SRE)
We are hiring a Site Reliability Engineer (SRE) to support the night-time operations of a mission-critical banking platform for a US-based enterprise client. This is a permanent night shift role tailored for experienced engineers who thrive in production environments and bring a proactive approach to incident resolution and automation.
You will work on system monitoring, incident response, and platform stability-while also improving observability, creating automation scripts, and collaborating with developers and DevOps teams. You wont just respond to alerts-youll help prevent them.
Work Mode : Permanent Night Shift
Note : This is a fixed night shift role. Candidates must have prior experience or explicitly confirm readiness for permanent US-time zone shifts.
Key Responsibilities :
- Monitor system health, SLIs / SLOs, and infrastructure using tools like Prometheus, Grafana, ELK, Stackdriver, etc.
- Lead incident triage for P1 / P2 alerts, engage in war rooms, update tickets (JIRA / SNOW), and participate in post-incident RCA documentation.
- Create or enhance automation scripts (Bash / Python) for log ingestion, alert suppression, auto-recovery, and health checks.
- Analyze application runtime issues-such as JVM logs, memory usage, GC pauses, or thread deadlocks-to support root cause analysis.
- Participate in daily DevOps / SRE standups, collaborating closely with engineering teams to improve production reliability.
- Handle database performance alerts (Oracle / Postgres) and collaborate with DBAs or developers to resolve backend bottlenecks.
- Track and interpret SLO breaches, availability metrics, and system latencies to enforce production SLAs.
Core Skills & Expertise : Technical Skills :
Experience with Grafana, Prometheus, ELK Stack, or Stackdriver. Able to define alerts, read logs, and correlate cross-system issues.Full ownership of P1 / P2 incidents - including triage, ticketing, stakeholder communication, and RCA participation.Proficient in Bash or Python scripting to automate routine SRE tasks and recovery workflows.Experience managing production workloads on GCP, AWS, or Azure, with ability to inspect cloud logs, VM status, networking, and storage configurations.Familiar with concepts like error budgets, latency thresholds, and SLO tracking. Capable of interpreting breaches and reporting anomalies.Able to spot symptoms of JVM issues like GC pauses, memory leaks, thread contention, and raise appropriate diagnostics.Identify backend delays or errors from logs and assist in pinpointing query or connection-related issues.Strong communication skills to work with distributed teams during escalations, code fixes, or configuration changes.Must be fully aligned to a permanent night shift (US time) and self-sufficient in a remote-first environment.Nice-to-Have Skills :
Familiarity with ServiceNow, change advisory boards, rollback planning, and structured release processes.Experience monitoring CPU, memory, and traffic metrics to recommend infrastructure scale-up / down strategies.Exposure to embedding SRE gates, smoke tests, or health validations in CI pipelines like Jenkins or GitHub Actions.Basic understanding of tools like SLO Generator or Datadog for automated budget tracking and alerting.Can interpret Terraform code related to monitoring, infrastructure, or alert rules. Not required to author full modules.Holding a GCP Associate Cloud Engineer or similar certification is a plus but not mandatory.(ref : hirist.tech)