Job Title : SRE Lead
Experience Level : ~10 years
Role Type : Engineering / Reliability
Role Overview :
The SRE Lead is responsible for leading site reliability initiatives across assigned product or platform areas, ensuring systems are scalable, reliable, and performant. This role defines and manages reliability goals, drives operational excellence, and partners closely with development, infrastructure, and QA teams to build and sustain highly available services.
Key Responsibilities :
- Lead SRE practices and initiatives across specific product or platform areas.
- Define and manage SLOs, SLIs, and error budgets to maintain system reliability targets.
- Implement and enhance instrumentation, monitoring, alerting, and automation frameworks.
- Collaborate with development, QA, and infrastructure teams to embed reliability and performance in design.
- Participate in incident response, perform root cause analysis, and drive continuous improvement actions.
- Guide and mentor SRE engineers in best practices, operational excellence, and incident handling.
- Contribute to building observability platforms and operational dashboards for proactive detection.
- Drive initiatives around capacity planning, scalability, and performance optimization.
- Promote a blameless culture and foster learning from post-incident reviews.
Required Qualifications & Experience :
10+ years of experience in site reliability engineering, operations, or production support roles.Strong experience with monitoring and observability tools (e.g., Prometheus, Grafana, Dynatrace, Datadog, ELK).Solid understanding of distributed systems, scaling patterns, and performance tuning.Hands-on experience with cloud platforms (GCP / AWS / Azure), Kubernetes, CI / CD, and automation tools.Expertise in incident management, postmortems, and resilience engineering.Strong mentoring, communication, and cross-team collaboration skills.Desirable Skills :
Familiarity with infrastructure-as-code (Terraform, Ansible) and service mesh frameworks.Experience in chaos engineering, DR planning, and cost optimization for reliability.Understanding of DevOps principles and AIOps concepts.Certification in cloud architecture or SRE practices (e.g., Google SRE, GCP Professional DevOps Engineer).