Job Description :
As a Principal / Chief Site Reliability Engineer, you will play a critical role in designing, developing, and maintaining scalable and highly reliable systems.
Youll work closely with development teams to improve system reliability, monitor critical applications, and design fail-proof :
- Design and implement scalable, highly available infrastructure and automation solutions.
- Drive adoption of SRE principles, SLAs, SLOs, and error budgets across teams.
- Proactively identify, debug, and resolve complex system reliability issues.
- Build tooling for observability, alerting, and performance monitoring.
- Collaborate with developers and architects on cloud-native design and service resilience.
- Conduct failure analysis, system audits, and root cause investigations.
- Contribute to strategic infrastructure decisions and reliability roadmaps.
- Promote influential leadership through mentorship and technical direction across teams.
- Work across multiple platforms and large-scale distributed systems.
Key Requirements :
Experience : 15+ years in technology, with at least 5+ years in Site Reliability Engineering.Development Background : Strong hands-on experience in C / C++, Java, Go, or Python.Proven experience as a hands-on Individual Contributor (not a managerial role).Proficiency in scripting, system programming, and multi-platform architecture.Deep knowledge of :a. Linux / Unix OS fundamentals.
b. Networking (DNS, TCP / IP, etc.
c. Cloud platforms (preferably AWS).
d. Observability and Monitoring Tools.
e. CI / CD and Infrastructure as Code.
Strong exposure to SRE concepts : reliability, automation, on-call best practices, etc.System design, performance tuning, and troubleshooting large-scale systems.(ref : hirist.tech)