Description :
Role : Site Reliability Engineer (SRE)
Location : Remote-First - (Bangalore)(Hybrid : Rare Office Attendance Only)
Work Mode : Permanent Night Shift
Experience Required : 6+ Years
Position Type : Individual Contributor
Role Summary :
We are seeking a Site Reliability Engineer (SRE) with 6+ years of experience to support the mission-critical operations of a large US banking client. This is a night shift role designed for proactive engineers who can combine production support with hands-on engineering and automation. You will monitor systems using observability tools, handle incidents, and collaborate closely with developers to ensure SLAs / SLOs are met.
You will not just respond to alerts, but actively improve reliability, automate repetitive tasks, and stabilize services. This is a remote-first role with rare office attendance (e.g., BCP drills or infra escalations). Candidates must be self-driven, capable of deep technical diagnosis, and comfortable with distributed collaboration. Note : This is a permanent night shift role. Candidates should confirm night shift readiness.
Must-Have Skills & Depth :
- Monitoring & Observability - Must have configured or extensively used dashboards in toos like Grafana, Prometheus, ELK, or GCP Stackdriver to monitor availability, latency, and errors. Must be able to define alert thresholds, interpret log patterns, and correlate multi-source metrics. Knowing the abovementioned tools is not necessary, and knowledge
- Incident Management - Must have handled P1 / P2 incidents end-to-end : alert triage, participation in bridge calls, creating / updating Jira / SNOW tickets, stakeholder communication, and drafting RCA / post-mortem reports.
- Automation & Scripting - Must have created or enhanced scripts in Bash or Python to automate health checks, alert suppression logic, log ingestion, or recovery steps. Should be able to write modular, reusable scripts.
- Cloud Platform Experience (Preferred GCP) - Must have worked on production-grade services on at least one cloud platform (GCP preferred). Experience using cloud-native monitoring, logs (e.g., Stackdriver, CloudWatch), and basic resource inspection (VMs, storage, network health).
- SLI / SLO Awareness - Should have tracked SLO breaches using observability platforms. Exposure to error budgets, latency SLIs, and availability metrics. No need to define them independently, but must interpret breaches.
- Java Runtime Awareness - Must be able to analyze JVM logs (GC pauses, memory issues, thread deadlocks). Not required to perform tuning, but must detect symptoms and raise root cause hypotheses.
- DB Performance Triage (Oracle / Postgres) - Must be able to spot DB-related errors or latency issues from logs / alerts. Not expected to tune queries, but must collaborate with DBAs / devs using evidence from logs or APM.
- Dev Collaboration - Should have participated in daily ops / dev stand-ups, escalations, or RCA calls, contributing production context to code fixes or config changes. Strong communication expected.
- Night Shift Readiness - Full alignment to US working hours is mandatory. Shift is fixed and non-rotational. Candidate must have experience working night shifts or must explicitly confirm readiness.
Nice-to-Have Skills :
Change Management - Exposure to ServiceNow, CAB processes, or deployment planning. Familiarity with structured release windows and rollback protocols.Capacity Planning - Assisted in planning infra scale-up / down based on usage trends, using monitoring tools or dashboards (CPU, memory, traffic alerts).CI / CD Integration - Familiar with embedding health checks, smoke tests, or SRE gates in Jenkins, GitHub Actions, or other pipelines.Error Budget Automation - Exposure to setting or consuming automated alerts when services cross SLO budgets using tools like SLO Generator, Datadog, or custom scripts.Terraform / IaC (Optional) - Able to read and interpret Terraform scripts, especially for monitoring agent deployment or alert rule provisioning. Not required to write from scratch.GCP Certification - GCP Associate Cloud Engineer or similar cert is a plus, not mandatory(ref : hirist.tech)