Description : Responsibilities :
- Manage and mentor a team of SREs, assigning tasks, providing technical guidance, and fostering a culture of collaboration and continuous learning.
- Lead the implementation of reliable, scalable, and fault-tolerant systems, including infrastructure, monitoring, and alerting.
- Manage incident response processes, including root cause analysis, post-mortem reviews, and proactive mitigation strategies to minimise system downtime and impact.
- Develop and maintain comprehensive monitoring systems to identify potential issues early, set appropriate alerting thresholds, and optimise system performance.
- Drive automation initiatives to streamline operational tasks, including deployments, scaling, and configuration management, utilising relevant tools and technologies.
- Proactively assess system capacity needs, plan for future growth, and implement scaling strategies to ensure optimal performance under load.
- Analyse system metrics and identify bottlenecks, implement performance improvements, and optimise resource utilisation.
- Work closely with development teams, product managers, and other stakeholders to ensure alignment on reliability goals and smooth integration of new features.
- Develop and implement the SRE roadmap, including technology adoption, standards, and best practices to maintain a high level of system reliability.
Requirements :
Strong proficiency in system administration, cloud computing (AWS, Azure), networking, distributed systems, and containerization technologies (Docker, Kubernetes).Expertise in scripting languages (Python, Bash) and ability to develop automation tools.Good to have a basic understanding of Java.Deep understanding of monitoring systems (Prometheus, Grafana), alerting configurations, and log analysis.Proven experience in managing critical incidents, performing root cause analysis, and coordinating response efforts.Excellent communication skills to convey technical concepts to both technical and non-technical audiences, ability to lead and motivate a team.Strong analytical and troubleshooting skills to identify and resolve complex technical issues.(ref : hirist.tech)