Job Description- Site Reliability Engineer
Experience- 8+ Years
Responsibilities :
- Ensure high availability, performance, and scalability of mission-critical systems and services.
- Lead the design and implementation of resilient and fault-tolerant infrastructure.
- Drive incident response, root cause analysis, and postmortem culture. Mentor others in incident practices.
- Write and maintain operational documentation, runbooks, and architecture diagrams.
- Drive and promote protocols on production readiness and operational excellence.
- Own and evolve infrastructure automation using Terraform or similar tools to remove as much as possible any human intervention.
- Help automate infrastructure provisioning and other engineering processes by working on automations built on top of an engineering platform written in GitHub Actions.
- Build internal platforms, tools, and frameworks to improve developer productivity and service reliability.
- Work closely with software engineers, platform teams, and product managers to align on company goals.
- Coach and up-skill other engineering team members
Skills and Qualifications :
8–12+ years in SRE, DevOps, or related infrastructure-focused roles.Understand large-scale complex systems from a reliability perspective.Design, implement and maintain processes and tools.Passion for producing clean, standards-compliant, secure code.Bringing a developer mindset and applying it to infrastructureStrong experience with Linux / Unix systems.Deep experience with Kubernetes.Deep experience with tools like Terraform, Ansible, Helm.Strong coding skills in scripts for automating the execution of certain tasks with a programming language like Python, Bash or any other scripting language.Experience with at least one relational and non-relational databases (ex : PostgreSQL, MySQL, MongoDB, Redis, ElasticSearch).Ability to identify time consuming and error prone manual tasks and then build / leverage tooling to automate them.Ability to identify root causes of instability in a large-scale distributed system across stacks.Experience leading high-severity incident responses and postmortemsNice to haves / Pluses :
Experience with cloud-based solutions such as Amazon AWS, Google Cloud, or Microsoft Azure.Experience supporting scalable DBs like PostgreSQL, or MongoDB in production.Understanding of cost