Job Description
What Youll Do
- Work with development teams to troubleshoot and resolve issues, minimizing customer impact.
- Develop and maintain automated runbooks to manage issues proactively.
- Apply engineering principles and automation to enhance our operating environments.
- Monitor and improve the reliability and performance of applications on the Guidewire Cloud Platform.
- Use your software engineering expertise to optimize systems and reduce manual toil.
- Document incidents and develop processes to prevent future occurrences.
- Stay current with industry trends, tools, and best practices in site reliability engineering.
- Foster a culture of innovation, learning, and continuous improvement.
- Participate in on-call rotations to ensure the availability and reliability of our services.
What Youll Bring
Experience as an SRE or similar role, with a focus on improving system reliability.Strong problem-solving skills and the ability to analyze complex systems and devise effective solutions.Effective collaboration and communication skills to work cross-functionally and document processes clearly.Experience with automation, monitoring, and performance optimization tools and techniques.Commitment to maximizing uptime, scalability, and delivering an exceptional end-user experience.Passion for technology and a desire to continuously learn and grow your skills.Alignment with Guidewires mission to leverage technology to help protect and support others.Required Skills :
Experience with designing and implementing SLIs, SLOs, and Error BudgetsFamiliarity with application performance monitoring (APM) and telemetry tools to maintain expected service levels for applicationsProficiency with Linux system administration and the ability to program / script using Python, Go, Java, shell, or equivalentExperience troubleshooting and debugging distributed systems on cloud infrastructureExperience with CICD pipelines within K8S and legacy ecosystemsExperience creating monitors, dashboards, and synthetic transactions in monitoring tools like DatadogExperience deploying and managing scalable infrastructure within AWS and Kubernetes ecosystems using Terraform and other cloud-native approachesSkills Required
Github, Saml, Postgresql, Python, Aws