Description
We are seeking an experienced Site Reliability Engineering (SRE) Lead to join our team in India. The ideal candidate will have a strong background in ensuring the reliability, scalability, and performance of our services while leading a team of SREs. This role requires a mix of technical expertise, leadership skills, and a passion for operational excellence.
Responsibilities
- Lead and mentor a team of Site Reliability Engineers (SREs) to ensure high availability and reliability of services.
- Design and implement monitoring, alerting, and incident response strategies.
- Collaborate with development teams to integrate SRE practices into the software development lifecycle.
- Automate manual processes to improve efficiency and reduce human error.
- Manage and optimize system performance and capacity planning.
- Conduct root cause analysis for incidents and implement corrective actions to prevent future occurrences.
- Develop and maintain documentation for systems, processes, and procedures.
Skills and Qualifications
7-15 years of experience in Site Reliability Engineering or related field.Strong knowledge of cloud platforms (AWS, GCP, Azure).Proficiency in scripting and programming languages such as Python, Go, or Ruby.Experience with containerization technologies (Docker, Kubernetes).Familiarity with CI / CD pipelines and DevOps practices.Understanding of networking concepts, load balancing, and distributed systems.Experience with monitoring tools (Prometheus, Grafana, Nagios).Strong problem-solving skills and the ability to work under pressure.Excellent communication and collaboration skills.Skills Required
Kubernetes, Prometheus, Grafana, Terraform, Cloud Computing, Scripting, Monitoring Tools, Incident Management, Networking