We are looking for a
highly skilled AWS Engineer with strong Python development and Chaos Engineering expertise
to design, build, and validate resilient, scalable, and automated cloud-native environments. The ideal candidate will combine cloud engineering, DevOps, and chaos experimentation to improve reliability, fault tolerance, and operational efficiency of critical systems.
Key Responsibilities
Cloud Engineering (AWS) :
Architect, implement, and manage secure, scalable, and cost-efficient AWS infrastructure (EC2, Lambda, EKS, S3, RDS, IAM, CloudFront, etc.).
Automate infrastructure provisioning and configuration using
Terraform / CloudFormation
and AWS SDKs.
Manage containerized workloads (Docker, Kubernetes, EKS).
Python Development :
Build automation scripts, deployment utilities, and infrastructure tooling using
Python (Boto3, Flask, FastAPI, etc.) .
Develop custom monitoring / alerting integrations with APIs, SDKs, and third-party observability platforms.
Implement self-healing and resilience-focused automation scripts.
Chaos Engineering & Resiliency :
Design and execute
chaos experiments
(fault injection, latency, outages, resource failures) to validate system resilience.
Use tools like
Gremlin, Litmus, Chaos Mesh, or AWS Fault Injection Simulator .
Partner with SRE and development teams to define
SLIs, SLOs, and error budgets .
Document learnings from chaos tests and improve incident response & recovery playbooks.
DevOps & Observability :
Build and maintain CI / CD pipelines for automated deployments (Jenkins, GitHub Actions, GitLab CI, AWS CodePipeline).
Integrate
observability frameworks
(Prometheus, Grafana, ELK / EFK, CloudWatch, Datadog) for monitoring and tracing.
Ensure proactive alerting and real-time visibility into system health.
Security & Compliance :
Apply AWS security best practices for IAM, networking, and data protection.
Ensure compliance with internal and external regulatory frameworks (SOC2, ISO, GDPR, etc.).
Required Skills & Qualifications
6–10 years
of experience in Cloud, DevOps, or SRE roles.
Strong hands-on expertise in AWS Cloud
(certifications preferred : AWS DevOps Engineer / Solutions Architect).
Advanced
Python development
skills for automation and tooling (Boto3 a must).
Experience designing and running
chaos experiments
(Gremlin, AWS FIS, Litmus, Chaos Mesh, or custom Python-based fault injection).
Solid knowledge of
IaC (Terraform / CloudFormation) .
Proficiency in
containers & orchestration (Docker, Kubernetes, EKS) .
Strong background in
monitoring, observability, and incident management .
Familiarity with
DevOps toolchain (CI / CD, Git, Jenkins, GitLab, CodePipeline) .
Good understanding of
resilient architectures, reliability principles, and disaster recovery .
Preferred Skills
Knowledge of
Go / Shell scripting
in addition to Python.
Experience with
chaos testing in production-like environments .
Exposure to
multi-cloud or hybrid-cloud environments .
Strong problem-solving and debugging skills.
What We Offer
Opportunity to lead
cloud reliability & chaos engineering initiatives .
Culture focused on
automation, resilience, and continuous improvement .
Growth opportunities through certifications, R&D projects, and leadership roles.
Site Reliability Engineer • Delhi, India