ob description
We are looking for an experienced Engineering Manager to lead our Site Reliability Engineering (SRE) team. The ideal candidate will have a strong background in SRE principles and practices, as well as experience managing and mentoring engineers. The SRE Manager will be responsible for the overall success of the SRE team, including ensuring that our systems are reliable, scalable, and secure. The team is responsible for monitoring the stability and availability of mission critical production systems, managing incidents for quicker resolution, and establishing BAU. Team also building tools / infra which to be used by all development teams to assist in monitoring and troubleshooting.
As a Site Reliability Engineering Manager at Arcesium, you are expected to :
- Manage a team of SRE engineers / SRE Leads
- Own end to end availability and performance of mission critical services and build automation to prevent problem recurrence
- Work closely with engineering managers and development teams to ensure that platforms are designed with scale and operability in mind
- Help manage the teams infrastructure e.g. containers infrastructure using Docker & Kubernetes cluster, Kakfa clusters, etc.
- Manage the teams AWS accounts and other infra provisioning.
- Day to day support of dashboard, including responding to outages and triaging cases escalated by clients / internal teams
- Manage on-call rotations to provide 24 hours coverage
- Ensure systems are always DR ready
- Manage team projects with Agile Methodology (Scrum / Kanban).
- Review various processes from time to time and drive continual improvement.
- Mentor SREs with incident case-studies and technical workshops
- Mentor and coach engineers to be curious and effective at discovering and solving technical challenges
What you ll need :
10+ years of experience in DevOps / Site reliability / Automation with 4+ years of People / Team Management exposureExperienced with variety of tools that help manage, understand, and debug large, complex distributed systemsGood knowledge of Unix system, web technologies, databases and public cloud systems like AWS, Networking, SystemsReliability : An exposure to Chaos Engineering and various reliability practices including disaster recovery will be good to haveIT Service Management : Incident Management, Problem Management, Change ManagementLanguages : Any of Python / Java / Node.js / RubyLinux : System Administration + Shell ScriptingCloud Computing : Amazon Web ServicesMicroservices & Containerization Docker, KubernetesVersion Control Git, Github, Gitlab, etc.Configuration Management Ansible / Chef / PuppetIT Service Management : Incident Management, Problem Management, Change ManagementAgile : Scrum, KanbanSkills Required
Unix, Shell Scripting, Automation