Job Title : SRE Head
Experience Level : ~10 years
Role Type :
Engineering / Reliability
Role Overview :
The SRE Head is responsible for leading and scaling the Site Reliability Engineering (SRE) function across the organization. This role defines the reliability strategy, standards, and practices to ensure high availability, performance, and resilience of critical systems. The SRE Head partners with engineering, infrastructure, and operations teams to embed reliability, observability, and continuous improvement across all services.
Key Responsibilities :
Lead and define the
SRE strategy , operating model, and best practices across the organization.
Establish and maintain
SLIs, SLOs, and SLAs
to measure and ensure service reliability and performance.
Oversee
incident management ,
post-incident reviews , and
root cause analysis
for major outages.
Drive
resilience engineering ,
disaster recovery , and
chaos engineering
initiatives.
Collaborate with
development, infrastructure, and operations teams
to improve reliability and automation.
Lead efforts to improve
observability , including metrics, logging, and tracing frameworks.
Foster a culture of
proactive reliability ,
continuous learning , and
blameless postmortems .
Mentor and guide
SRE leads and engineers , building high-performing reliability teams.
Track and communicate
reliability trends , key metrics, and risk areas to leadership.
Evaluate and adopt emerging tools and practices to enhance platform reliability and scalability.
Required Qualifications & Experience :
10+ years
of experience in
SRE, reliability engineering, or production operations
in large-scale environments.
Proven expertise in
availability management ,
incident response , and
service continuity .
Strong technical understanding of
cloud platforms (GCP / AWS / Azure) ,
Kubernetes ,
CI / CD , and
automation .
Proficiency in
observability tools
(e.g., Prometheus, Grafana, Dynatrace, Datadog, ELK, OpenTelemetry).
Experience implementing
SLIs / SLOs ,
error budgets , and
capacity planning frameworks .
Strong
leadership ,
strategic thinking , and
cross-functional collaboration
skills.
Excellent
communication ,
mentoring , and
culture-building
abilities.
Desirable Skills : Experience in
building and scaling SRE organizations
or CoEs.
Exposure to
performance engineering ,
cost optimization , and
AIOps practices .
Deep understanding of
network reliability ,
security resiliency , and
compliance-driven uptime goals .
Certification in
reliability or cloud architecture
(e.g., Google SRE, GCP Professional Architect).
Sre • India