Lead Site Reliability Engineer
Location : Bangalore
Experience : 10 - 15 Years (with 5+ years in Site Reliability Engineering)
Job Summary :
We are seeking a highly experienced and technically proficient Lead Site Reliability Engineer (SRE) to join our team in Delhi. With 10-15 years of overall experience and a minimum of 5 years dedicated to Site Reliability Engineering principles, you will be instrumental in ensuring the reliability, scalability, and performance of our critical systems. This role demands a deep understanding and hands-on application of SRE concepts including SLOs, SLIs, SLAs, error budgets, aggressive toil elimination through automation, robust observability, and effective emergency response. You will lead by example, driving operational excellence and fostering a culture of reliability across engineering teams.
Key Responsibilities :
- Lead the adoption and implementation of Site Reliability Engineering principles across critical services and infrastructure.
- Define, implement, and continuously monitor Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs) for all production systems.
- Manage and utilize error budgets effectively to balance reliability with innovation and feature velocity.
- Drive significant toil elimination through automation, designing and implementing automated solutions for routine operational tasks, deployments, and remediation.
- Design, implement, and manage comprehensive observability and monitoring solutions, ensuring deep insights into system health, performance, and user experience. This includes logging, metrics, tracing, and alerting.
- Lead emergency response efforts, including effective incident triage, root cause analysis through blameless post-mortems, and driving actionable outcomes from retrospectives.
- Collaborate with development and operations teams to design and build highly scalable, resilient, and fault-tolerant systems.
- Provide technical leadership and mentorship to other engineers, advocating for SRE best practices and fostering a strong reliability culture.
- Participate in capacity planning and performance engineering to ensure systems can handle current and future load.
- Drive continuous improvement in system reliability, performance, and operational efficiency through systematic approaches.
- Design and implement robust disaster recovery and business continuity plans.
Required Skills and Qualifications :
Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field.10 - 15 years of overall experience in software development, operations, or infrastructure roles.Minimum of 5 years of relevant, hands-on experience in Site Reliability Engineering principles and practices.Strong understanding and practical application of SLOs, SLIs, SLAs, and error budgets.Proven expertise in automation, with hands-on experience using scripting languages like Python, Go, or Bash.Extensive experience with observability and monitoring tools and platforms (e.g., Prometheus, Grafana, ELK Stack, Datadog, Splunk, New Relic).In-depth knowledge of incident management processes, including leading triage, conducting blameless post-mortems, and driving effective retrospectives.Experience with cloud platforms (AWS, Azure, or Google Cloud Platform).Proficient in containerization technologies (e.g., Docker) and orchestration platforms (e.g., Kubernetes).Strong understanding of CI / CD pipelines and their integration with SRE practices.Solid knowledge of distributed systems, microservices architectures, and their operational challenges.Excellent communication, interpersonal, and leadership skills, with the ability to influence and collaborate across various teams.Preferred Skills :
Experience with specific SRE-focused tools and platforms.Certifications in cloud platforms or SRE-related domains.Familiarity with chaos engineering principles and tools.Experience in performance testing and load testing.Contributions to open-source SRE tools or communities.ref : hirist.tech)