Job Summary :
We are seeking a visionary and strategic VP Site Reliability Engineering (SRE) to join the leadership team. This is a foundational role within the CTO organization, where you will be a key member in shaping the company's SRE function and culture.
The ideal candidate will be a senior leader with a deep understanding of SRE principles and a proven ability to drive large-scale automation and operational engineering across the organization.
This position requires a professional who can blend strategic direction with hands-on expertise to ensure the reliability, performance, and scalability of both on-premises and public cloud platforms.
Key Responsibilities :
Strategic Leadership & Implementation :
- Define, drive, and implement a comprehensive SRE strategy that aligns with the organization's business goals. This includes evangelizing SRE principles and methodologies across engineering and operations teams.
Operational Excellence & Automation :
Promote an "Automate-first" culture by developing and implementing methodologies to identify and eliminate manual toil, inefficiency, and redundancy in operational processes.Service Metrics & SLOs :
Develop and implement engineering and operational service metrics (SLOs) for critical services. You will be responsible for defining and operating to Error Budgets with actionable plans to improve operational efficiency and enhance service quality.Reliability Engineering :
Design and implement reliability improvements and lead architectural reviews focused on resilience. You will also be responsible for conducting capacity planning and chaos engineering exercises to proactively identify and mitigate system weaknesses.Observability & Monitoring :
Define and execute a modern monitoring strategy. You must have implemented and operated a wide range of observability technologies for enterprise-grade production systems, with experience in tools like OpenTelemetry, Prometheus, and Grafana.Continuous Improvement :
Drive a culture of continuous improvement by leading post-incident reviews and ensuring that key learnings are translated into actionable changes across the SDLC.Deployment & Processes :
Contribute to the overall Test and Deployment processes, ensuring they are as reliable and automated as possible through practices like Infrastructure as Code (IaC) and Configuration Management.Required Skills & Qualifications :
10+ years of hands-on experience in a dedicated SRE role, with a strong background in software development.A bachelor's degree or higher in Computer Science, Information Systems, or a related field, or equivalent work experience.Practical experience in defining and implementing Service Level Objectives (SLOs) and operating to Error Budgets.A comprehensive understanding of SRE principles and the ability to evangelize them across a large organization.Extensive knowledge of Infrastructure as Code (IaC) principles and design, with proficiency in Configuration Management Solutions such as Ansible, Chef, or Puppet.Expertise in modern observability tooling, including OpenTelemetry, Prometheus, Grafana, and associated projects.Strong analytical skills and a solid understanding of all critical production support processes.5+ years of hands-on experience with one or more public / private cloud platforms (e.g., AWS, Azure).(ref : hirist.tech)