About the Role
We are looking for a skilled Site Reliability Engineer II to join our SRE team. The ideal candidate will have hands-on experience in production monitoring, alert handling, and L1 production support . You will play a key role in ensuring the reliability, availability, and performance of our production systems.
Key Responsibilities
- Monitor production systems using enterprise monitoring tools and dashboards.
- Respond to alerts promptly and take appropriate first-level actions.
- Provide L1 production support , including initial triage, log analysis, and escalation to relevant teams as needed.
- Participate in incident management, including documentation, communication, and coordination during production incidents.
- Perform basic troubleshooting for application, infrastructure, and platform issues.
- Ensure adherence to SLAs, SLOs, and operational best practices.
- Contribute to runbooks, knowledge base articles, and incident postmortems.
- Collaborate with engineering and DevOps teams for incident resolution and improvements.
- Participate in on-call rotations as required.
Required Skills & Qualifications
2–5 years of experience in SRE, Production Support, DevOps, or similar roles.Hands-on experience with production monitoring tools (e.g., Prometheus, Grafana, Datadog, New Relic, Splunk, CloudWatch, etc.).Strong understanding of alerting systems , incident lifecycle, and on-call processes.Basic troubleshooting knowledge in Linux / Unix , networking fundamentals, and cloud environments.Familiarity with logging tools (e.g., ELK, Splunk, Cloud Logging).Ability to communicate clearly during incidents and coordinate with cross-functional teams.Strong analytical, problem-solving, and time-management skills.Good to Have
Experience with cloud platforms (AWS / Azure / GCP).Basic scripting skills (Python, Shell, Bash).Exposure to CI / CD pipelines and DevOps practices.Understanding of SLOs, SLIs, and reliability engineering principles.