Join us in bringing joy to customer experience. Five9 is a leading provider of cloud contact center software, bringing the power of cloud innovation to customers worldwide.
Living our values everyday results in our team-first culture and enables us to innovate, grow, and thrive while enjoying the journey together. We celebrate diversity and foster an inclusive environment, empowering our employees to be their authentic selves.
Director of Site Reliability Engineering
The Director of Site Reliability Engineering is responsible for leading the strategic vision, operational excellence, and organizational capability of our SRE function. This role combines technical leadership with people management to build and scale a world-class SRE organization that enables rapid innovation while maintaining exceptional reliability standards.
As the senior leader of the SRE discipline, you will establish the technical strategy, culture, and practices that ensure our systems can scale reliably to meet business demands. You will build and lead a team of SRE professionals, partner with engineering leadership across the organization, and drive the adoption of SRE principles and practices.
This is a hands-on leadership role requiring deep technical expertise, proven ability to scale engineering organizations, and a track record of building reliable systems at scale. The ideal candidate will balance reliability with tactical execution, driving both immediate operational excellence and long-term architectural improvements where necessary.
Key Responsibilities
Strategic Leadership & Vision
- Define and execute the long-term SRE strategy aligned with business objectives and technical roadmap
- Establish reliability standards, SLI / SLO frameworks, and error budget policies across services
- Drive architectural decisions that improve system reliability, scalability, and operational efficiency
- Partner with engineering leadership to influence platform and application design for reliability
- Represent SRE perspective in executive technical discussions and strategic planning
Team Leadership & Development
Build, lead, and scale a high-performing SRE organizationRecruit, hire, and onboard top-tier SRE talent across multiple experience levelsDevelop career progression frameworks and growth paths for SRE professionalsFoster a culture of continuous learning, blameless post-mortems, and operational excellenceProvide technical mentorship and leadership development for senior SRE staffOperational Excellence & Incident Management
Manage and oversee enterprise-wide incident response processes and on-call practicesDrive root cause analysis programs and ensure systematic elimination of failure modesImplement sustainable on-call practices that maintain work-life balance while ensuring coverageOversee capacity planning and resource optimization strategies across all servicesEstablish metrics and reporting frameworks for reliability, performance, and operational healthCross-Functional Partnership
Collaborate with VP / Director level peers in Engineering, Product, and InfrastructureWork with Security leadership to integrate reliability and security practicesPartner with Finance on cost optimization initiatives and capacity planning budgetsEngage with Customer Success and Support teams on reliability-impacting issuesPlatform & Tooling Strategy
Drive the simplification and reduction of observability, monitoring, and alerting platformsEstablish automation standards and drive toil reduction initiativesHelp improve CI / CD pipeline architecture and deployment practicesInfluence infrastructure-as-code and configuration management strategiesOrganizational & Process Innovation
Implement SRE best practices including error budgets, toil tracking, and reliability reviewsEstablish metrics-driven decision making and continuous improvement processesDrive adoption of chaos engineering and proactive reliability testingCreate and maintain SRE documentation, runbooks, and knowledge sharing systemsDevelop and execute disaster recovery and business continuity plansRequired Skills
Leadership & Management Experience
Bachelor&aposs or Master&aposs degree in Computer Science, Engineering, or equivalent experience8+ years in engineering leadership roles, with 4+ years managing managersProven track record of building and scaling engineering teamsExperience with performance management, career development, and succession planningStrong executive presence and ability to influence without authorityExperience driving organizational change and cultural transformationTechnical Expertise
Experience with multiple cloud platforms (AWS, GCP, Azure) and hybrid environmentsDeep understanding of distributed systems, microservices architecture, and cloud platformsHands-on experience with modern observability tools (Prometheus, Grafana, Datadog, etc.)Strong background in infrastructure automation, CI / CD, and infrastructure-as-codeExpertise in capacity planning, performance optimization, and cost managementSRE & Operations Mastery
Deep understanding of SRE principles, practices, and implementation at scaleExperience establishing SLI / SLO frameworks and error budget managementProven track record of improving system reliability and reducing operational toilExperience with incident management, post-mortem processes, and reliability engineeringBackground in 24 / 7 operations and on-call management best practicesBusiness & Strategic Acumen
Understanding of budget management, resource allocation, and ROI analysisAbility to communicate technical concepts to non-technical stakeholders and executivesExperience with vendor management and technology partnership decisionsKnowledge of compliance frameworks and regulatory requirementsDesired Skills
Advanced Technical Background
Background in container orchestration (Kubernetes) and service mesh technologiesKnowledge of database administration and data platform reliabilityExperience with security engineering and DevSecOps practicesSuccess Metrics
Reliability & Performance
Achieve and maintain service availability targets (typically 99.9%+ uptime)Reduce mean time to detection (MTTD) and mean time to recovery (MTTR)Improve capacity planning accuracy and reduce over-provisioning costsIncrease deployment frequency while maintaining reliability standardsTeam & Organizational Development
Build and retain a high-performing SRE organization with low attritionEstablish clear career progression and achieve high employee satisfaction scoresDevelop internal talent and promote from within the SRE organizationCreate sustainable on-call practices with reasonable operational loadOperational Excellence
Drive measurable reduction in operational toil and manual interventionsEstablish comprehensive observability and proactive alerting across all servicesImplement effective incident response with blameless post-mortem cultureAchieve cost optimization targets while maintaining reliability standardsFive9 embraces diversity and is committed to building a team that represents a variety of backgrounds, perspectives, and skills. The more inclusive we are, the better we are. Five9 is an equal opportunity employer.
View our privacy policy, including our privacy notice to California residents here : https : / / www.five9.com / pt-pt / legal.
Note : Five9 will never request that an applicant send money as a prerequisite for commencing employment with Five9.
Show more
Show less
Skills Required
Performance Optimization, Distributed Systems, Cost Management, Prometheus, Grafana, Datadog, Infrastructure Automation, Capacity Planning, Gcp, Incident Management, Azure, Kubernetes, Aws