Talent.com
No longer accepting applications
Cloud Engineering Ops Lead (AWS + Application Support)

Cloud Engineering Ops Lead (AWS + Application Support)

ConglomerateIT IndiaNagpur, IN
10 hours ago
Job description

About us

ConglomerateIT is a certified and a pioneer in providing premium end-to-end Global Workforce Solutions and IT Services to diverse clients across various domains. Visit us at http : / / www.conglomerateit.com

Our mission is to establish global cross culture human connections that further the careers of our employees and strengthen the businesses of our clients. We are driven to use the power of global network to connect business with the right people without bias. We provide Global Workforce Solutions with affability.

About job

Job Title : Cloud Engineering Ops Lead (AWS + Application Support).

Location : Hyderabad (onsite)

Experience Level : 10+ years

Role Overview

We are seeking a Cloud Engineering Lead to drive reliability, performance, and operational excellence across complex AWS environments and production applications. This hybrid role combines the disciplines of Site Reliability Engineering, Cloud Operations, Application Support, and DevOps to ensure seamless, secure, and cost-efficient delivery of business-critical services.

The ideal candidate will bring deep AWS expertise, automation proficiency, and a strong focus on observability, incident management, and continuous improvement.

Key Responsibilities

1. AWS Cloud & Infrastructure Operations

Design, operate, and optimize AWS environments — including EC2, EKS, RDS, ALB / CloudFront, IAM / OIDC, VPC / TGW / SGs.

Implement Infrastructure as Code (IaC) using Terraform and configuration management via Ansible.

Maintain system hygiene, patching, and OS-level administration across cloud workloads.

Drive cost optimization through tagging, right-sizing, and lifecycle management.

2. Site Reliability Engineering (SRE)

Establish and maintain SLIs, SLOs, and error budgets to ensure service reliability.

Lead incident management, post-mortems, and drive systemic improvements.

Develop and maintain automated runbooks and resiliency playbooks for predictable recovery.

Measure and continuously improve MTTR and change failure rates.

3. Application & Production Support

Own production readiness through deployment validation, rollback planning, and performance baselines.

Support application deployments and lead post-deployment smoke testing and validation.

Troubleshoot production issues end-to-end — across infrastructure, middleware, and application layers.

Partner with development teams to ensure smooth CI / CD integrations and controlled releases.

4. Observability & Monitoring

Build and maintain comprehensive observability using CloudWatch, Prometheus, Grafana, Datadog, or equivalent.

Ensure actionable alerts, clear dashboards, and proper alert routing to responders.

Improve logging, tracing, and metrics coverage to drive proactive issue detection.

5. Backup, DR & Security

Define and validate backup, retention, and restore policies with measurable RPO / RTO objectives.

Implement cross-region replication and disaster recovery strategies.

Maintain strong security posture via IAM policies, OIDC integrations, and role-based access controls.

6. DevOps Enablement

Collaborate with DevOps teams to improve pipeline efficiency, deployment reliability, and release governance.

Automate operational workflows and reduce manual toil using Python, Bash, and IaC tools.

Integrate reliability metrics into CI / CD pipelines to ensure operational readiness before release.

7. Leadership & Mentoring

Lead Sev-1 / 2 incident bridges with structured communication and post-resolution follow-ups.

Mentor engineers in SRE best practices, automation, and cloud operations maturity.

Foster a culture of reliability, transparency, and continuous improvement across teams.

Success Metrics

  • Reliability : Improved uptime, lower MTTR, and reduced change failure rates.
  • Visibility : Every service is monitored, logged, and observable.
  • Resilience : Regular restore tests pass; DR documentation validated quarterly.
  • Efficiency : Cloud spend optimized with >

95% tagging compliance.

  • Automation : Reduced manual toil through IaC and scripting.
  • Required Skills & Experience

  • 10+ years of experience in Cloud Operations, SRE, or Production Engineering roles.
  • Proven expertise in AWS services, Terraform, Ansible, and Python / Bash scripting.
  • Strong experience in incident response, post-mortem analysis, and production support.
  • Hands-on with monitoring and observability tools (CloudWatch, Datadog, Prometheus, Grafana).
  • Deep understanding of cloud networking, IAM security, and backup / DR planning.
  • Experience collaborating with DevOps teams and driving automation at scale.
  • Excellent communication and leadership skills to guide teams during critical incidents.
  • Create a job alert for this search

    Application Engineering • Nagpur, IN