Roles and Responsibilities
- Implement and manage system monitoring solutions to track health, performance, and availability
- Identify and resolve incidents promptly to reduce downtime and customer impact
- Lead incident response efforts and conduct root cause analysis
- Drive continuous improvement initiatives to increase system reliability and maintainability
- Participate in post-mortems and blameless retrospectives
- Collaborate closely with development, operations, and other cross-functional teams
- Maintain configuration management for various applications and systems
- Implement comprehensive service monitoring (dashboards, metrics, alerts)
- Define, measure, and achieve service level objectives (uptime, performance, incidents)
- Support high-quality product development and release in partnership with stakeholders
Skills Required
System Monitoring, Incident Response, Continuous Improvement, Configuration, Collaboration