Talent.com
Principal AWS Reliability Engineer

Principal AWS Reliability Engineer

graph8Hyderabad, Republic Of India, IN
1 day ago
Job description

Your mission : build and maintain a secure, automated, and observable AWS foundation so engineers can ship faster, safer, and cheaper. You’ll be the owner of deployment velocity, system uptime, and cloud cost sanity across our ECS-based microservices.

What You’ll Own

1. Platform Reliability

  • Design and maintain ECS clusters (Fargate / EC2) for multi-service workloads.
  • Implement autoscaling, health checks, and blue / green rollouts for zero-downtime deployments.
  • Build observability into everything — logs, metrics, traces — to shorten MTTR.

2. Delivery Automation

  • Architect and maintain CI / CD pipelines using GitHub Actions + CodePipeline / CodeBuild .
  • Enforce testing, security scanning, and deployment gates as part of every release.
  • Move from semi-manual deploys to fully automated pipelines across environments.
  • 3. Network & Security

  • Manage VPC architectures (subnets, routing, gateways, VPN, endpoints).
  • Handle Route 53 for internal / external DNS, SSL / TLS, health checks, and routing policies.
  • Maintain multi-account setup with IAM least privilege, KMS encryption, and security baselines.
  • 4. Infrastructure as Code

  • Define all infra in Terraform / CDK;
  • no console drift.

  • Use IaC reviews and environments for repeatable, compliant infrastructure.
  • 5. Data Layer Operations

  • Operate and optimize ClickHouse and PostgreSQL clusters — backups, replication, partitioning, and tuning.
  • Ensure RTO / RPO objectives are met and documented.
  • 6. Monitoring & Debugging

  • Aggregate logs (CloudWatch, FireLens, OpenTelemetry).
  • Build dashboards and alerts that highlight anomalies, not noise.
  • Lead root-cause investigations across network, container, and app layers.
  • Core Tech Stack

  • AWS : ECS (Fargate / EC2), EC2, S3, VPC, Route 53, CloudWatch, CodePipeline, CodeBuild
  • CI / CD : GitHub Actions, Docker, Terraform / CDK
  • Databases : ClickHouse, PostgreSQL
  • Languages (plus) : FastAPI (Python), Node.Js
  • Networking : DNS, VPN, load balancers, private link, peering, NAT, IGW
  • Security : Multi-account strategy, IAM roles / policies, KMS, AWS Config, GuardDuty
  • Requirements

  • 5+ years running production workloads on AWS.
  • Deep knowledge of ECS, CodePipeline, EC2 / VPC, S3 , and Docker .
  • Proven track record of shipping secure automated deployments .
  • Strong understanding of networking and DNS fundamentals.
  • Experience managing databases in production.
  • Strong debugging and observability mindset.
  • Clear written communication and operational discipline.
  • Nice to Have

  • Familiarity with FastAPI or Node.Js applications to optimize deployment flows.
  • Hands-on with cost-optimization and cross-account automation (Organizations, Control Tower).
  • Experience setting up VPNs , Bastion, or SSO integration.
  • What Success Looks Like

  • ✅ All ECS services deployed via automated pipelines.
  • ✅ CloudWatch dashboards and alerts in place for core systems.
  • ✅ Verified ClickHouse and PostgreSQL backups / restores.
  • ✅ Documented multi-account / VPC network topology.
  • ✅ No manual deploys, no console changes.
  • Why This Role Matters

    This role defines the foundation for everything we build. The more you automate, the faster teams deliver.

    You’ll directly impact uptime, developer productivity, and cloud spend — three metrics that define operational excellence.

    Create a job alert for this search

    Reliability Engineer • Hyderabad, Republic Of India, IN

    Related jobs
    • Promoted
    Engineer, Site Reliability [T500-20266]

    Engineer, Site Reliability [T500-20266]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 28 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Sonata SoftwareHyderabad, Telangana, India
    Category Details Role Site Reliability Engineer (SRE) III – Data Engineering Location Hyderabad- Employment Type Full Time Experience 7–12 years in. EdTech platforms (2U) Primary Skills (Must-Have) ...Show moreLast updated: 25 days ago
    • Promoted
    Sr Engineer, Site Reliability Engineer [T500-20464]

    Sr Engineer, Site Reliability Engineer [T500-20464]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 28 days ago
    • Promoted
    Engineer, Site Reliability [T500-20521]

    Engineer, Site Reliability [T500-20521]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 28 days ago
    • Promoted
    Site Reliability Engineer - AWS / Google Cloud Platform

    Site Reliability Engineer - AWS / Google Cloud Platform

    INDIGLOBE IT SOLUTIONS PRIVATE LIMITEDHyderabad
    Job Summary : We are looking for a Senior Site Reliability Engineer (SRE) to join our growing Engineering team.As an SRE, you will play a key role in ensuring the rel...Show moreLast updated: 30+ days ago
    • Promoted
    AutoRABIT - Senior Site Reliability Engineer - AWS Infrastructure

    AutoRABIT - Senior Site Reliability Engineer - AWS Infrastructure

    AutoRABIT Software Pvt LtdHyderabad
    Description : AutoRABIT is the leader in DevSecOps for SaaS platforms such as Salesforce.Its unique metadata-aware capability makes Release Management, Version Contro...Show moreLast updated: 30+ days ago
    • Promoted
    Principal Engineer, Site Reliability - Accounting Technology [T500-20232]

    Principal Engineer, Site Reliability - Accounting Technology [T500-20232]

    ANSRhyderabad, telangana, in
    ANSR is hiring for one of its clients.NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flags...Show moreLast updated: 30+ days ago
    • Promoted
    Lead - Cloud Reliability Engineer

    Lead - Cloud Reliability Engineer

    Searce Inchyderabad, telangana, in
    The ‘process-first’ AI-native modern tech consultancy that's rewriting the rules.As an engineering-led consultancy, we are dedicated to relentlessly improving the real business outcomes.Our solvers...Show moreLast updated: 30+ days ago
    • Promoted
    Engineer, Site Reliability [T500-20515]

    Engineer, Site Reliability [T500-20515]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 28 days ago
    • Promoted
    Engineer, Site Reliability [T500-20517]

    Engineer, Site Reliability [T500-20517]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 28 days ago
    • Promoted
    Principal Engineer, Application Reliability

    Principal Engineer, Application Reliability

    TMUS Global SolutionsHyderabad, Republic Of India, IN
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 29 days ago
    • Promoted
    Principal Engineer, Site Reliability T500-20295

    Principal Engineer, Site Reliability T500-20295

    TMUS Global SolutionsHyderabad, Republic Of India, IN
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 29 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    CitNOW GroupHyderabad, IN
    Founded in 2008, CitNOW is an innovative, enterprise-level software product suite that allows automotive dealerships globally to sell more vehicles and parts more profitably.CitNOW’s app-based plat...Show moreLast updated: 2 days ago
    • Promoted
    Principal Engineer, Site Reliability [T500-20295]

    Principal Engineer, Site Reliability [T500-20295]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 28 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Capgeminisecunderabad, telangana, in
    Choosing Capgemini means choosing a company where you will be empowered to shape your career in the way you’d like, where you’ll be supported and inspired by a collaborative community of colleagues...Show moreLast updated: 13 days ago
    • Promoted
    Principal Site Reliability Engineer

    Principal Site Reliability Engineer

    TMUS Global SolutionsHyderabad, Republic Of India, IN
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 29 days ago
    • Promoted
    Principal Systems Reliability Engineer

    Principal Systems Reliability Engineer

    TMUS Global SolutionsHyderabad, Republic Of India, IN
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 29 days ago
    • Promoted
    Engineer, Site Reliability [T500-20519]

    Engineer, Site Reliability [T500-20519]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 28 days ago