Talent.com
Cloud Reliability Engineering Lead

Cloud Reliability Engineering Lead

ConglomerateIT IndiaHyderabad, Republic Of India, IN
1 day ago
Job description

About us

ConglomerateIT is a certified and a pioneer in providing premium end-to-end Global Workforce Solutions and IT Services to diverse clients across various domains. Visit us at http : / / www.Conglomerateit.Com

Our mission is to establish global cross culture human connections that further the careers of our employees and strengthen the businesses of our clients. We are driven to use the power of global network to connect business with the right people without bias. We provide Global Workforce Solutions with affability.

About job

Job Title : Cloud Engineering Ops Lead (AWS + Application Support).

Location : Hyderabad (onsite)

Experience Level : 10+ years

Role Overview

We are seeking a Cloud Engineering Lead to drive reliability, performance, and operational excellence across complex AWS environments and production applications. This hybrid role combines the disciplines of Site Reliability Engineering, Cloud Operations, Application Support, and DevOps to ensure seamless, secure, and cost-efficient delivery of business-critical services.

The ideal candidate will bring deep AWS expertise, automation proficiency, and a strong focus on observability, incident management, and continuous improvement.

Key Responsibilities

1. AWS Cloud & Infrastructure Operations

Design, operate, and optimize AWS environments — including EC2, EKS, RDS, ALB / CloudFront, IAM / OIDC, VPC / TGW / SGs.

Implement Infrastructure as Code (IaC) using Terraform and configuration management via Ansible.

Maintain system hygiene, patching, and OS-level administration across cloud workloads.

Drive cost optimization through tagging, right-sizing, and lifecycle management.

2. Site Reliability Engineering (SRE)

Establish and maintain SLIs, SLOs, and error budgets to ensure service reliability.

Lead incident management, post-mortems, and drive systemic improvements.

Develop and maintain automated runbooks and resiliency playbooks for predictable recovery.

Measure and continuously improve MTTR and change failure rates.

3. Application & Production Support

Own production readiness through deployment validation, rollback planning, and performance baselines.

Support application deployments and lead post-deployment smoke testing and validation.

Troubleshoot production issues end-to-end — across infrastructure, middleware, and application layers.

Partner with development teams to ensure smooth CI / CD integrations and controlled releases.

4. Observability & Monitoring

Build and maintain comprehensive observability using CloudWatch, Prometheus, Grafana, Datadog, or equivalent.

Ensure actionable alerts, clear dashboards, and proper alert routing to responders.

Improve logging, tracing, and metrics coverage to drive proactive issue detection.

5. Backup, DR & Security

Define and validate backup, retention, and restore policies with measurable RPO / RTO objectives.

Implement cross-region replication and disaster recovery strategies.

Maintain strong security posture via IAM policies, OIDC integrations, and role-based access controls.

6. DevOps Enablement

Collaborate with DevOps teams to improve pipeline efficiency, deployment reliability, and release governance.

Automate operational workflows and reduce manual toil using Python, Bash, and IaC tools.

Integrate reliability metrics into CI / CD pipelines to ensure operational readiness before release.

7. Leadership & Mentoring

Lead Sev-1 / 2 incident bridges with structured communication and post-resolution follow-ups.

Mentor engineers in SRE best practices, automation, and cloud operations maturity.

Foster a culture of reliability, transparency, and continuous improvement across teams.

Success Metrics

  • Reliability : Improved uptime, lower MTTR, and reduced change failure rates.
  • Visibility : Every service is monitored, logged, and observable.
  • Resilience : Regular restore tests pass;

DR documentation validated quarterly.

  • Efficiency : Cloud spend optimized with >
  • 95% tagging compliance.

  • Automation : Reduced manual toil through IaC and scripting.
  • Required Skills & Experience

  • 10+ years of experience in Cloud Operations, SRE, or Production Engineering roles.
  • Proven expertise in AWS services, Terraform, Ansible, and Python / Bash scripting.
  • Strong experience in incident response, post-mortem analysis, and production support.
  • Hands-on with monitoring and observability tools (CloudWatch, Datadog, Prometheus, Grafana).
  • Deep understanding of cloud networking, IAM security, and backup / DR planning.
  • Experience collaborating with DevOps teams and driving automation at scale.
  • Excellent communication and leadership skills to guide teams during critical incidents.
  • Create a job alert for this search

    Engineering Lead • Hyderabad, Republic Of India, IN

    Related jobs
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    CodeKarmahyderabad, telangana, in
    Site Reliability Engineer (Multi-Cloud Deployments).CodeKarma is redefining how engineering teams understand and evolve complex systems — bringing production context directly into the developer’s w...Show moreLast updated: 24 days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    AutoRABIThyderabad, telangana, in
    AutoRABIT is the leader in DevSecOps for SaaS platforms such as Salesforce.Its unique metadata-aware capability makes Release Management, Version Control, and Backup & Recovery complete, reliable, ...Show moreLast updated: 30+ days ago
    • Promoted
    Cloud Infrastructure Reliability Engineer

    Cloud Infrastructure Reliability Engineer

    TMUS Global SolutionsHyderabad, Republic Of India, IN
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 29 days ago
    • Promoted
    • New!
    Lead Cloud Solutions Engineer

    Lead Cloud Solutions Engineer

    People Prime WorldwideHyderabad, Republic Of India, IN
    Important Note (Please Read Before Applying).Cloud Infrastructure Lead / Architect years.Random / irrelevant applications will. Random / irrelevant applications will not be processed.Immediate to 15 d...Show moreLast updated: 15 hours ago
    • Promoted
    Cloud Engineering Ops Lead (AWS + Application Support)

    Cloud Engineering Ops Lead (AWS + Application Support)

    ConglomerateIT IndiaHyderabad, Telangana, India
    ConglomerateIT is a certified and a pioneer in providing premium end-to-end Global Workforce Solutions and IT Services to diverse clients across various domains. Our mission is to establish global c...Show moreLast updated: 3 days ago
    • Promoted
    Senior DevOps & Cloud Reliability Engineer

    Senior DevOps & Cloud Reliability Engineer

    TMUS Global SolutionsHyderabad, Republic Of India, IN
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 29 days ago
    • Promoted
    Lead Site Reliability Engineer

    Lead Site Reliability Engineer

    AutoRABITHyderabad, Republic Of India, IN
    AutoRABIT is the leader in DevSecOps for SaaS platforms such as Salesforce.Its unique metadata-aware capability makes Release Management, Version Control, and Backup & Recovery complete, reliable, ...Show moreLast updated: 30+ days ago
    • Promoted
    Cloud Infrastructure & Application Lead

    Cloud Infrastructure & Application Lead

    ConglomerateIT IndiaHyderabad, Republic Of India, IN
    ConglomerateIT is a certified and a pioneer in providing premium end-to-end Global Workforce Solutions and IT Services to diverse clients across various domains. Our mission is to establish global c...Show moreLast updated: 1 day ago
    • Promoted
    Lead - Cloud Reliability Engineer

    Lead - Cloud Reliability Engineer

    Searce Inchyderabad, telangana, in
    The ‘process-first’ AI-native modern tech consultancy that's rewriting the rules.As an engineering-led consultancy, we are dedicated to relentlessly improving the real business outcomes.Our solvers...Show moreLast updated: 30+ days ago
    • Promoted
    Cloud Engineering Lead Analyst

    Cloud Engineering Lead Analyst

    EvernorthHyderabad, India
    Evernorth Health Services, a division of The Cigna Group (NYSE : CI), creates pharmacy, care, and benefits solutions to improve health and increase vitality. We relentlessly innovate to make the pred...Show moreLast updated: 30+ days ago
    • Promoted
    Lead DevOps & Cloud Reliability Engineer

    Lead DevOps & Cloud Reliability Engineer

    TMUS Global SolutionsHyderabad, Republic Of India, IN
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 28 days ago
    • Promoted
    Lead DevOps and Cloud Reliability Engineer

    Lead DevOps and Cloud Reliability Engineer

    TMUS Global SolutionsHyderabad, Republic Of India, IN
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 28 days ago
    • Promoted
    Lead Systems Reliability Engineer

    Lead Systems Reliability Engineer

    TMUS Global SolutionsHyderabad, Republic Of India, IN
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 29 days ago
    • Promoted
    Lead Reliability Engineer

    Lead Reliability Engineer

    TMUS Global SolutionsHyderabad, Republic Of India, IN
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 29 days ago
    • Promoted
    Principal Site Reliability Engineer

    Principal Site Reliability Engineer

    TMUS Global SolutionsHyderabad, Republic Of India, IN
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 29 days ago
    • Promoted
    Cloud Reliability Engineer

    Cloud Reliability Engineer

    TMUS Global SolutionsHyderabad, Republic Of India, IN
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 29 days ago
    • Promoted
    Lead Performance and Reliability Engineer

    Lead Performance and Reliability Engineer

    TMUS Global SolutionsHyderabad, Republic Of India, IN
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 29 days ago
    • Promoted
    Cloud Engineering Ops Lead

    Cloud Engineering Ops Lead

    ConglomerateIT IndiaHyderabad, Republic Of India, IN
    ConglomerateIT is a certified and a pioneer in providing premium end-to-end Global Workforce Solutions and IT Services to diverse clients across various domains. Our mission is to establish global c...Show moreLast updated: 3 days ago