Role Summary :
We are seeking a highly skilled Site Reliability Engineer (SRE) with a strong focus on observability. The ideal candidate will have 5-8 years of experience in implementing and managing monitoring, logging, and alerting systems. This role requires expertise in the Kubernetes stack, as well as a solid foundation in coding and Infrastructure as Code to ensure the reliability and health of our systems.
Key Responsibilities :
- Observability Implementation : Design and implement comprehensive observability solutions, including monitoring, logging, and alerting.
- Kubernetes Stack Management : Work extensively with the Kubernetes stack and related tools such as Prometheus, Loki, Grafana, and Alert Manager to ensure system performance and reliability.
- Coding & Automation : Apply proficiency in Python & Go to solve complex problems, automate tasks, and contribute to the development of tools and systems.
- Infrastructure & CI / CD : Utilize Infrastructure as Code and manage CI / CD pipelines to ensure continuous and reliable deployments.
- Troubleshooting : Apply strong troubleshooting and problem-solving skills to diagnose and resolve issues efficiently and proactively.
Required Skills :
Observability : Expertise in all aspects of observability, including Monitoring, Logging, and Alerting.Kubernetes Stack : Deep knowledge and hands-on experience with Prometheus, Loki, Grafana, and Alert Manager.Programming : Strong coding skills in Python & Go, sufficient for technical challenges.DevOps : Experience with CI / CD pipelines and Infrastructure as Code (IaC).Problem-Solving : Strong troubleshooting and problem-solving abilities.Cloud : Experience with AWS is mandatory.Nice to Have Skills :
Incident Management : Familiarity with PagerDuty.Integrations : Experience with the Zoom Developer Platform.Education & Experience :
Education : A Bachelor's degree in Computer Science, Information Technology, or a related field is preferred.
Experience : A minimum of 5-8 years of experience in a Site Reliability or DevOps engineering role, with a focus on observability.
(ref : hirist.tech)