Job Title : Service Reliability Engineer
Location : Bangalore (Hybrid)
Experience : 9-12 Years
Mode of Working : Hybrid (Office-Based)
About The Role
We are looking for a highly skilled and experienced Lead Service Reliability Engineer (SRE) to join our growing team. In this role, you will be responsible for ensuring the reliability, performance, and scalability of our production systems. You'll play a key part in incident response, infrastructure automation, and driving operational excellence across the organization.
Key Responsibilities
- Handle and lead the response to production incidents with calm and clarity.
- Communicate effectively with internal teams and clients during outages.
- Draft detailed Root Cause Analysis (RCA) documents post-incident.
- Monitor and improve the performance, stability, and health of production systems.
- Proactively identify and resolve system issues by analyzing metrics and logs.
- Scale infrastructure to meet business objectives while adhering to SLA / SLO targets.
- Perform upgrades and maintenance on EKS clusters.
- Administer Kubernetes clusters and ensure optimal configuration and performance.
- Automate infrastructure using Terraform and Terragrunt (IaC).
- Integrate observability and security checks into CI / CD pipelines.
Required Skills & Qualifications
Proven experience in managing production environments and incident handling.Hands-on experience with incident management tools (e.g., PagerDuty, ServiceNow).Strong expertise in observability tools (e.g., Datadog).Proficient in scripting / programming using Python or similar languages.Solid understanding and administration of Kubernetes.Expertise in Infrastructure as Code (IaC) using Terraform and Terragrunt.In-depth experience with AWS, including :IAM (with cross-account role experience preferred)EC2, VPC, S3Networking (VPC, Transit Gateway, NACLs, Security Groups)Experience with EKS for cluster management and upgrades.Familiarity with CI / CD pipelines and DevOps best practices.Preferred / Bonus Skills
Exposure to infrastructure security and best practices :IAM least privilege, encryption, secrets management, etc.Experience working in Agile / Scrum environments.What We Offer
Opportunity to work on high-impact, production-critical systems.Collaborative and inclusive work culture.Competitive compensation and benefits.Learning and growth opportunities in cloud-native technologies and DevOps practices.Join us and lead the charge in building scalable, reliable, and secure systems that power our mission.
(ref : hirist.tech)
Skills Required
Servicenow, Terraform, Datadog, Kubernetes, Python, Aws