Key Responsibilities :
- Lead reliability engineering projects and drive them to successful completion.
- Ensure system stability, high availability, and optimal performance through proactive monitoring and troubleshooting.
- Design, build, and maintain reliable and scalable cloud-based infrastructure and services.
- Implement and manage observability tools (Grafana, Splunk, Dynatrace) for real-time monitoring, alerting, and logging.
- Automate manual and repetitive processes using Python, Bash, or PowerShell to enhance operational efficiency.
- Manage and optimize CI / CD pipelines and automation frameworks (Jenkins, GitLab CI, Ansible, Chef).
- Drive adoption of SRE principles — including SLIs, SLOs, SLAs, and Error Budgets — across teams.
- Provide on-call support and lead incident management , ensuring effective root cause analysis and postmortems.
- Collaborate with development and infrastructure teams to enhance platform observability and reduce operational toil.
- Engage in capacity planning, cost optimization , and scalability strategy discussions.
Skills Required
Site Reliability Engineering, Cloud Infrastructure, Kubernetes, Docker, Aws, Gcp