Talent.com
Site Reliability Engineering Lead

Site Reliability Engineering Lead

Infinite Computer SolutionsBengaluru, Republic Of India, IN
7 days ago
Job description

We are looking for Site Reliability Engineering (SRE) Devops Manager

Location : Bangalore / Hyderabad / Chennai / Noida / Pune / Visakhapatnam / Gurgaon

Shift timing : regular

Can join Immediate - 30 days

Interested candidates, Please share your profiles and below details to

Email ID : shanmukh.varma@infinite.com

Total experience :

Relevant Experience :

Current CTC :

Expected CTC :

Notice Period :

If Serving Notice Period, Last working day :

Email ID : shanmukh.varma@infinite.com

Job Summary

We are seeking an experienced Site Reliability Engineering (SRE) Manager to lead and evolve our cloud infrastructure, reliability practices, and automation strategy. This role blends hands-on technical leadership with strategic oversight to ensure scalable, secure, and reliable systems across AWS-based environments.

As an SRE Manager, you will guide a team of DevOps and SRE engineers to design, build, and operate cloud-native platforms leveraging Kubernetes (EKS) , Terraform , and AWS DevOps tools . You will drive operational excellence through observability, automation, and AIOps—enhancing reliability, performance, and cost efficiency.

You will collaborate closely with development, product, and security teams to define SLOs, manage error budgets , and continuously improve infrastructure resilience and developer productivity.

Key Responsibilities

Leadership & Strategy

  • Lead, mentor, and grow a global team of Site Reliability and DevOps Engineers.
  • Define and drive the reliability roadmap, SLOs, and error budgets across services.
  • Establish best practices for infrastructure automation, observability, and incident response.
  • Partner with engineering leadership to shape long-term cloud, Kubernetes, and AIOps strategies.

Infrastructure & Automation

  • Design, implement, and manage AWS cloud infrastructure using Terraform (advanced modules, remote state management, custom providers).
  • Build and optimize CI / CD pipelines using AWS CodePipeline, CodeBuild, CodeDeploy, and CodeCommit.
  • Manage EKS clusters with focus on scalability, reliability, and cost efficiency—leveraging Helm, ingress controllers, and service mesh (e.G., Istio).
  • Implement robust security and compliance practices (IAM policies, network segmentation, secrets management).
  • Automate environment provisioning for dev, staging, and production using Infrastructure as Code (IaC).
  • Monitoring, Observability & Reliability

  • Lead observability initiatives using Prometheus, Grafana, CloudWatch, and OpenSearch / ELK .
  • Improve system visibility and response times by enhancing monitoring, tracing, and alerting mechanisms.
  • Drive proactive incident management and root cause analysis (RCA) to prevent recurring issues.
  • Apply chaos engineering principles and reliability testing to ensure resilience under load.
  • AIOps & Advanced Operations

  • Integrate AIOps tools to proactively detect, diagnose, and remediate operational issues.
  • Design and manage scalable deployment strategies for AI / LLM workloads (e.G., Llama, Claude, Cohere).
  • Monitor model performance and reliability across hybrid Kubernetes and managed AI environments.
  • Stay current with MLOps and Generative AI infrastructure trends, applying them to production workloads.
  • Manage 24 / 7 operations using apropos alerting tools and follow-the-sun model
  • Cost Optimization & Governance

  • Analyze and optimize cloud costs through instance right-sizing, auto-scaling, and spot usage.
  • Implement cost-aware architecture decisions and monitor monthly spend for alignment with budgets.
  • Establish cloud governance frameworks to enhance cost visibility and accountability across teams.
  • Collaboration & Process

  • Partner with developers to streamline deployment workflows and improve developer experience.
  • Maintain high-quality documentation, runbooks, and postmortem reviews.
  • Foster a culture of reliability, automation, and continuous improvement across teams.
  • Create a job alert for this search

    Lead Site Reliability Engineering • Bengaluru, Republic Of India, IN

    Related jobs
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ReyikaBengaluru, Karnataka, India
    Senior Site Reliability Engineer / Reliability Architect.Pune,Bengalore,Chennai,Pune,Noida.Reliability Architect with over 9 years of experience in proactive monitoring, automation, and observabili...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    London Stock Exchange GroupBangalore, India
    Engineer, Site Reliability Engineering.We are evolving our Reliability Engineering team to move beyond support and operations. As a Senior Engineer in Site Reliability, you will be part of a diverse...Show moreLast updated: 30+ days ago
    • Promoted
    Technical Lead, Site Reliability Engineering

    Technical Lead, Site Reliability Engineering

    London Stock Exchange GroupBangalore, India
    We are looking for a seasoned SRE professional to serve as Technical Lead - Site Reliability Engineering (SRE) for the Digital Identity and Fraud (DI&F) product suite within the Risk Intelligence b...Show moreLast updated: 11 days ago
    • Promoted
    Senior Site Reliability Engineer (SRE)

    Senior Site Reliability Engineer (SRE)

    Tata Consultancy ServicesBengaluru, Karnataka, India
    Senior Site Reliability Engineer (SRE).Senior Site Reliability Engineer (SRE).Desired Experience Range : 7 - 10 yrs.Notice Period : Immediate to 90Days only. We are currently planning to do a Virtual....Show moreLast updated: 26 days ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    Awign Experthosur, tamil nadu, in
    Position : SRE Observability Engineer.Mandatory Skills : Observability, Grafana and Writing queries using Prometheus and Loki. We are seeking a highly experienced and driven Senior Observability Engin...Show moreLast updated: 5 hours ago
    • Promoted
    Lead Site Reliability Engineer

    Lead Site Reliability Engineer

    Media.netBengaluru, Karnataka, India
    Our proprietary contextual technology is at the forefront of enhancing Programmatic buying, the latest industry standard in ad buying for digital platforms. HQ is based in New York, and the Global H...Show moreLast updated: 4 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    SynechronBengaluru, Karnataka, India
    We have immediate opportunity for Senior Site Reliability Engineer.Senior Site Reliability Engineer.At Synechron, we believe in the power of digital to transform businesses for the better.Our globa...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    Karixhosur, tamil nadu, in
    We are seeking an experienced professional Site Reliability Engineer who acts as a bridge between development and IT operations, taking operational tasks to ensure the efficient functioning of Serv...Show moreLast updated: 5 hours ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    o9 Solutions, Inc.Bengaluru, Karnataka, India
    Be part of something revolutionary.At o9 Solutions, our mission is clear : be the Most Valuable Platform (MVP) for enterprises. With our AI-driven platform — the o9 Digital Brain — we integrate globa...Show moreLast updated: 6 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    GREYTIP SOFTWARE PRIVATE LIMITEDBengaluru, Karnataka, India
    About the Role We are looking for a skilled Site Reliability Engineer II to join our SRE team.The ideal candidate will have hands-on experience in production monitoring, alert handling, and L1 pro...Show moreLast updated: 4 days ago
    • Promoted
    Lead Site Reliability Engineer

    Lead Site Reliability Engineer

    Delta Air LinesBengaluru, India
    Execute on the Incident, Change Management, Problem Management processes.Building and supporting reliable applications that meet development and maintenance requirements. Provide consultation and di...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer (SRE) – Infrastructure & Automation

    Site Reliability Engineer (SRE) – Infrastructure & Automation

    InstaServicehosur, tamil nadu, in
    InstaService is revolutionizing the home services industry through AI-driven technology, connecting customers with trusted professionals instantly. We’re growing fast across 23+ states and expanding...Show moreLast updated: 14 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    super.moneyBengaluru, Karnataka, India
    Site Reliability Engineer (SRE) Level 3.A Site Reliability Engineer (SRE) Level 3 is a senior technical leadership role focused on designing, implementing, and maintaining large-scale, complex, and...Show moreLast updated: 16 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    People Prime WorldwideBengaluru, IN
    Our client is a French multinational information technology (IT) services and consulting company, headquartered in Paris, France. Founded in 1967, It has been a leader in business transformation for...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Senior Site Reliability Engineer (SRE)

    Senior Site Reliability Engineer (SRE)

    Voya Indiahosur, tamil nadu, in
    We are seeking a strategic and technically adept leader to drive the scalability, resilience, and operational excellence of our enterprise systems. This role will set the vision for site reliability...Show moreLast updated: 4 hours ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    JRD SystemsBengaluru, India
    Site Reliability Engineer (Windows / Cloud / Automation).We are seeking an experienced Site Reliability Engineer with a strong background in managing Windows infrastructure and cloud environments.T...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Datum Technologies Grouphosur, tamil nadu, in
    Job Title : Site Reliability Engineer (SRE) – AWS.AWS, Terraform, Kubernetes, Docker, Grafana, Prometheus, Datadog.We are looking for a skilled Site Reliability Engineer (SRE) with strong AWS experi...Show moreLast updated: 8 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PhonePehosur, tamil nadu, in
    SRE We are looking for engineers who are passionate about reliability, performance, and efficiency, and with experience in building tools, services, and automation to manage and improve production ...Show moreLast updated: 16 days ago