Talent.com
Lead Reliability Engineer

Lead Reliability Engineer

Elios TalentHyderabad, Republic Of India, IN
22 hours ago
Job description

Senior Site Reliability Engineer

Key Highlights

šŸ› ļø Build, scale, and optimize cloud-native infrastructure powering global, high-availability platforms

⚔ Drive automation-first engineering across AWS, Terraform, CI / CD, observability, and resilient systems

šŸ“Š Own reliability, uptime, system health, costs, and performance across mission-critical environments

šŸ” Strengthen DevSecOps practices—improving security, delivery velocity, and operational excellence

🚨 Lead major incident response, troubleshoot complex issues, and uphold production stability at scale

Position Overview

We are seeking a Senior Site Reliability Engineer to drive reliability, automation, and performance for large-scale, cloud-based platforms. This role blends deep technical engineering, systems thinking, DevOps collaboration, and operational leadership.

You will design and implement scalable infrastructure, improve observability, enhance resiliency, manage incident operations, and champion modern DevSecOps practices. This role plays a critical part in supporting tens of thousands of daily users while ensuring platforms remain secure, fast, and highly available.

Key Responsibilities

Cloud Engineering

  • Architect, deploy, and optimize AWS environments using automation and Infrastructure-as-Code
  • Build tooling that increases predictability, stability, and delivery speed
  • Optimize systems for scale, reliability, cost, and performance
  • Maintain repeatable, traceable, and transparent infrastructure through Terraform and automation
  • Monitor cloud spend and usage, ensuring alignment with service-level objectives

Observability & Reliability

  • Own uptime, reliability, system security, performance metrics, and golden signals
  • Lead incident management and triage bridges during major events
  • Enhance telemetry systems (NewRelic, CloudWatch, DataDog) for deep operational visibility
  • Use data-driven analysis to improve system stability and customer experience
  • Ensure architecture and deployment patterns meet SLAs and reliability goals
  • DevSecOps & Automation

  • Strengthen CI / CD pipelines, code-review practices, and engineering standards
  • Partner with Cybersecurity to address vulnerabilities through automation
  • Support secure, consistent, and scalable delivery workflows across engineering teams
  • Resiliency Engineering

  • Identify failure points, blast-radius risks, and architectural gaps
  • Run failure-injection / chaos testing to validate resiliency
  • Forecast traffic, plan for seasonal peaks, and scale systems for 2x+ load scenarios
  • Drive improvements to infrastructure and software to meet resiliency targets
  • Leadership & Collaboration

  • Mentor engineers across levels, promoting high-quality engineering practices
  • Collaborate daily with product, engineering, and security teams in a DevOps model
  • Document, uplift, and share knowledge through cross-team forums and best practices
  • Qualifications

  • Experience as a software engineer with strong debugging + deployment skills
  • Hands-on expertise with AWS and Terraform (required)
  • Experience with ECS, and Kubernetes / EKS experience strongly preferred
  • Strong proficiency in Python, Golang, Bash, and automation frameworks
  • CI / CD experience with Jenkins, GitHub Enterprise, CircleCI, or similar
  • Ability to troubleshoot across web servers, app servers, OS, networks, storage, and databases
  • Experience running large-scale, high-availability production systems
  • Strong communication, root-cause analysis, and incident leadership skills
  • BS in Computer Science or equivalent industry experience
  • About Us

    We build scalable, secure, and high-performing digital platforms that power global user experiences. By combining cloud engineering, automation, observability, and resilient systems design, we help organizations operate more reliably, innovate faster, and support long-term platform stability and growth.

    Why Join Us

    Join a forward-thinking engineering organization where reliability, automation, and performance are core values. You’ll work with a modern cloud stack, collaborate with exceptional engineers, and own meaningful technical impact across large-scale applications. This is an opportunity to shape infrastructure strategy, elevate engineering practices, and build systems that support millions with consistency and excellence.

    Create a job alert for this search

    Reliability Engineer • Hyderabad, Republic Of India, IN