This job offer is not available in your country.

Site Reliability Engineer

XebiaLucknow, IN

21 days ago

Job description

We are looking for a highly skilled AWS Engineer with strong Python development and Chaos Engineering expertise to design, build, and validate resilient, scalable, and automated cloud-native environments. The ideal candidate will combine cloud engineering, DevOps, and chaos experimentation to improve reliability, fault tolerance, and operational efficiency of critical systems.

Key Responsibilities

Cloud Engineering (AWS) :

Architect, implement, and manage secure, scalable, and cost-efficient AWS infrastructure (EC2, Lambda, EKS, S3, RDS, IAM, CloudFront, etc.).
Automate infrastructure provisioning and configuration using Terraform / CloudFormation and AWS SDKs.
Manage containerized workloads (Docker, Kubernetes, EKS).

Python Development :

Build automation scripts, deployment utilities, and infrastructure tooling using Python (Boto3, Flask, FastAPI, etc.) .

Develop custom monitoring / alerting integrations with APIs, SDKs, and third-party observability platforms.

Implement self-healing and resilience-focused automation scripts.

Chaos Engineering & Resiliency :

Design and execute chaos experiments (fault injection, latency, outages, resource failures) to validate system resilience.

Use tools like Gremlin, Litmus, Chaos Mesh, or AWS Fault Injection Simulator .

Partner with SRE and development teams to define SLIs, SLOs, and error budgets .

Document learnings from chaos tests and improve incident response & recovery playbooks.

DevOps & Observability :

Build and maintain CI / CD pipelines for automated deployments (Jenkins, GitHub Actions, GitLab CI, AWS CodePipeline).

Integrate observability frameworks (Prometheus, Grafana, ELK / EFK, CloudWatch, Datadog) for monitoring and tracing.

Ensure proactive alerting and real-time visibility into system health.

Security & Compliance :

Apply AWS security best practices for IAM, networking, and data protection.

Ensure compliance with internal and external regulatory frameworks (SOC2, ISO, GDPR, etc.).

Required Skills & Qualifications

6–10 years of experience in Cloud, DevOps, or SRE roles.

Strong hands-on expertise in AWS Cloud (certifications preferred : AWS DevOps Engineer / Solutions Architect).

Advanced Python development skills for automation and tooling (Boto3 a must).

Experience designing and running chaos experiments (Gremlin, AWS FIS, Litmus, Chaos Mesh, or custom Python-based fault injection).

Solid knowledge of IaC (Terraform / CloudFormation) .

Proficiency in containers & orchestration (Docker, Kubernetes, EKS) .

Strong background in monitoring, observability, and incident management .

Familiarity with DevOps toolchain (CI / CD, Git, Jenkins, GitLab, CodePipeline) .

Good understanding of resilient architectures, reliability principles, and disaster recovery .

Preferred Skills

Knowledge of Go / Shell scripting in addition to Python.

Experience with chaos testing in production-like environments .

Exposure to multi-cloud or hybrid-cloud environments .

Strong problem-solving and debugging skills.

What We Offer

Opportunity to lead cloud reliability & chaos engineering initiatives .

Culture focused on automation, resilience, and continuous improvement .

Growth opportunities through certifications, R&D projects, and leadership roles.

Create a job alert for this search

Site Reliability Engineer • Lucknow, IN