Responsibilities
- Perform diagnosis and triage of technical issues; resolve incidents where possible.
- Collaborate with product owners to validate runbooks, monitoring dashboards, SLAs, and notification processes.
- Handle escalated issues from Site Reliability Engineers (SREs), addressing root causes.
- Improve system observability through enhancements to logging, monitoring, and automated tooling.
- Build or enhance automation for compliance detection and resource efficiency.
- Refactor product code for maintainability and modularity; ensure code is developer-friendly.
- Participate in peer code reviews, sharing tools, troubleshooting techniques, and best practices.
- Join on-call rotations to support high system uptime and quick resolution.
- Act as a technical leader advising on how to structure SRE work for maximum business value
Skills Required
ibp , Servicenow, Supply Chain Operations, Incident Management, Root Cause Analysis