Looking for a Manager, Site Reliability Engineering to help us scale our systems and ensure
stability, reliability and performance and rapid deployments of our platform. We build teams that
are inclusive, collaborative, and have a strong sense of ownership for the things they build. If you
- have a passion and track record for solving problems;
moreover, have strong leadership skills, this is a great fit for you.
As Manager, SRE you will demonstrate both emerging and current technologies, methods, and
processes contributing to the evolution of software deployment processes, enhancing security,
reducing risk, and improving the overall end-user experience. As part of the Technology R&D Team, you will play an integral part in advancing DevOps maturity and be a part of a new culture of quality and site reliability. You will continually improve our CI / CD tools, processes, and procedures. You will also be responsible for regular reporting to Senior Technology Leaders and providing updates on organizational risk exposure and risk related issues.
What You Will Be Doing :
Set the direction and strategy for your team, and help shape the overall SRE program for thecompany
Support the growth by ensuring a robust, scalable, cloud-first infrastructureOwn site stability, performance and capacity planningParticipate early in the SDLC to ensure reliability is built in from the beginning, and creatingplans for successful implementations / launches
Foster a learning and ownership culture within the team and the larger organizationEnsure best engineering practices through automation, infrastructure as code, robust systemmonitoring, alerting, auto scaling, self-healing, etc...
Manage complex technical projects and a team of SREsRecruit and develop staff;build a culture of excellence in site reliability and automation
Lead by example – roll up your sleeves by debugging and coding;participate in on-call rotation
& occasional travel
Represent the technology perspective and priorities to leadership and other stakeholders bycontinuously communicating timeline, scope, risks, and technical road map
What You Will Need for this Position :
10+ years of hands-on technical leadership and people management experience3+ years of demonstrable experience leading site reliability and performance in large-scale,high-traffic environments
Strong leadership, communication and interpersonal skills geared to getting things doneDeveloping themselves and the talent within their charge – fostering and creatingopportunity for the team
Architect-level understanding of one or more of the major public cloud services (AWS, GCP orAzure), using them to effectively design secure and scalable services
Strong understanding of SRE concepts and the DevOps culture, with a focus on leveragingsoftware engineering tools, methodologies and concepts
In-depth understanding of automation and CI / CD processes to go along with excellentreasoning and problem-solving skills
Experience with Unix / Linux environments with a deep grasp on system internalsWorked on large-scale distributed systems including multi-tiered architectureStrong knowledge of modern platforms like Fargate, Docker, Kubernetes etc.Experience working with monitoring tools (Datadog, NewRelic, ELK stack, etc) and Databasetechnologies (SQL Server, Postgres and Couchbase preferred)
Validated breadth of understanding and development of solutions based on multipletechnologies, including networking, cloud, database, and scripting languages.
Experience in prompt engineering, building AI Agents, or MCP is a plus.