We are looking for a highly skilled AWS Engineer with strong Python development and Chaos Engineering expertise to design, build, and validate resilient, scalable, and automated cloud-native environments. The ideal candidate will combine cloud engineering, DevOps, and chaos experimentation to improve reliability, fault tolerance, and operational efficiency of critical systems.
Key Responsibilities
Cloud Engineering (AWS) :
- 6–10 years of experience in Cloud, DevOps, or SRE roles. Strong hands-on expertise in AWS Cloud (certifications preferred : AWS DevOps Engineer / Solutions Architect) Architect, implement, and manage secure, scalable, and cost-efficient AWS infrastructure (EC2, Lambda, EKS, S3, RDS, IAM, CloudFront, etc.).
- Automate infrastructure provisioning and configuration using Terraform / CloudFormation and AWS SDKs.
- Manage containerized workloads (Docker, Kubernetes, EKS).
Python Development :
Build automation scripts, deployment utilities, and infrastructure tooling using Python (Boto3, Flask, FastAPI, etc.) . Advanced Python development skills for automation and tooling (Boto3 a must).Develop custom monitoring / alerting integrations with APIs, SDKs, and third-party observability platforms.Implement self-healing and resilience-focused automation scripts.Chaos Engineering & Resiliency :
Design and execute chaos experiments (fault injection, latency, outages, resource failures) to validate system resilience.Use tools like Gremlin, Litmus, Chaos Mesh, or AWS Fault Injection Simulator . Experience designing and running chaos experiments (Gremlin, AWS FIS, Litmus, Chaos Mesh, or custom Python-based fault injection).Partner with SRE and development teams to define SLIs, SLOs, and error budgets .Document learnings from chaos tests and improve incident response & recovery playbooks.DevOps & Observability :
Build and maintain CI / CD pipelines for automated deployments (Jenkins, GitHub Actions, GitLab CI, AWS CodePipeline).Familiarity with DevOps toolchain (CI / CD, Git, Jenkins, GitLab, CodePipeline) .Proficiency in containers & orchestration (Docker, Kubernetes, EKS) .Good understanding of resilient architectures, reliability principles, and disaster recoveryIntegrate observability frameworks (Prometheus, Grafana, ELK / EFK, CloudWatch, Datadog) for monitoring and tracing.Ensure proactive alerting and real-time visibility into system health.Strong background in monitoring, observability, and incident management .Security & Compliance :
Apply AWS security best practices for IAM, networking, and data protection.Ensure compliance with internal and external regulatory frameworks (SOC2, ISO, GDPR, etc.).Required Skills & Qualifications
Advanced Python development skills for automation and tooling (Boto3 a must).Experience designing and running chaos experiments (Gremlin, AWS FIS, Litmus, Chaos Mesh, or custom Python-based fault injection).Solid knowledge of IaC (Terraform / CloudFormation) .Proficiency in containers & orchestration (Docker, Kubernetes, EKS) .Strong background in monitoring, observability, and incident management .Familiarity with DevOps toolchain (CI / CD, Git, Jenkins, GitLab, CodePipeline) .Good understanding of resilient architectures, reliability principles, and disaster recovery .Preferred Skills
Knowledge of Go / Shell scripting in addition to Python.Experience with chaos testing in production-like environments .Exposure to multi-cloud or hybrid-cloud environments .Strong problem-solving and debugging skills.What We Offer
Opportunity to lead cloud reliability & chaos engineering initiatives .Culture focused on automation, resilience, and continuous improvement .Growth opportunities through certifications, R&D projects, and leadership roles.