Responsibilities :
- Oversee the platform's stability and performance as part of a 24 / 7 NOC team and monitor production releases based on complexity and risk assessment.
- Build, manage, and scale reliable cloud production services to enhance operational efficiency.
- Design, develop, and improve end-to-end reliability and maintainability for all SailPoint SaaS services.
- Coach engineering teams on observability best practices and defining Service Level Objectives (SLOs).
- Lead post-incident reviews and define effective preventive actions.
- Collaborate with developers to improve system reliability through embedding programs.
- Provide guidance, best practices, and support as part of an SRE Centre of Excellence.
- Manage cross-functional requirements by working with Engineering, Product, Services, and other teams.
- Develop and implement automation tools and processes to streamline operations and enhance system performance.
- Mentor team members on design reviews, code, test cases, automation, observability, root cause analysis, and self-healing.
- Drive operational excellence for frictionless operations, optimal on-call performance, and enhanced customer experience.
Skills Required
Automation, Linux, Networking, Debugging, Agile, Incident Management