About The Role :
We are looking for a highly experienced Staff Site Reliability Engineer (SRE) to drive the reliability, performance, and operational excellence of our core production systems.
This is a senior, hands-on role that requires deep expertise in large-scale distributed systems, complex incident management, and building world-class observability platforms.
Key Responsibilities :
Reliability Engineering :
- Define, measure, and enforce Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical platform services.
- Drive down toil by promoting self-service and automation.
Observability Platform :
Lead the design and implementation of our global observability stack, including metric collection (Prometheus / M3DB), distributed tracing (Jaeger / OpenTelemetry), and logging (Loki / Elasticsearch).Incident Management :
Act as a technical leader during high-severity incidents, perform in-depth Root Cause Analysis (RCA), and implement long-term preventative measures.Performance Tuning :
Conduct performance analysis and capacity planning for the entire platform, optimizing infrastructure and application bottlenecks.Security & Compliance :
Partner with the security team to enforce security controls and best practices across the infrastructure layer.Mentorship & Evangelism :
Mentor SRE and DevOps teams, and evangelize reliability best practices and engineering excellence across all product development teams.Technical Skills (Must-Have) :
Distributed Systems :
Proven experience designing, running, and debugging large-scale distributed systems and microservices in a high-traffic environment.Cloud & Kubernetes :
Expert proficiency in managing highly available Kubernetes clusters (i.e., K8s on GCP / AWS / Azure) and their underlying cloud resources.Observability Stack :
Deep, hands-on experience with modern observability tools (Prometheus, Grafana, :Expert in at least one modern programming language (Go / Python) for writing operators, automation tooling, and extending monitoring systems.Infrastructure as Code (IaC) :
Advanced knowledge of Terraform for managing multi-cloud infrastructure.Networking :
Advanced understanding of network concepts in a cloud / container environment (service mesh, network policies, load balancing).Qualifications :
Bachelor's or Master's degree in Computer Science or a related technical field.8+ years of professional experience in SRE, DevOps, or Infrastructure Engineering roles.History of successfully implementing reliability improvements that result in measurable SLO adherence(ref : hirist.tech)