About the Role
We are looking for a Cloud Site Reliability Engineer (SRE) with deep expertise in Amazon Web Services (AWS) to design, build, and maintain scalable, reliable, and secure cloud infrastructure. You will work closely with development, DevOps, and operations teams to ensure system uptime, performance, and cost efficiency.
Key Responsibilities
Reliability & Performance :
- Design and maintain highly available, fault-tolerant systems on AWS using services like EC2, ECS, EKS, Lambda, RDS, and CloudFront.
Automation & Infrastructure as Code :
Implement and manage infrastructure with Terraform , CloudFormation , or CDK to ensure repeatability and scalability.Monitoring & Incident Response :
Develop observability solutions using CloudWatch , Prometheus / Grafana , Datadog , or New Relic .Define SLIs / SLOs / SLAs and manage on-call rotations for incident response and root-cause analysis.
CI / CD & Deployment :
Work with tools like Jenkins , GitHub Actions , AWS CodePipeline , or ArgoCD to build automated pipelines.Security & Compliance :
Implement best practices for IAM, KMS, VPC security, and compliance (SOC2, ISO 27001, HIPAA, etc.).Required Qualifications
Bachelor’s degree in Computer Science, Engineering, or equivalent experience3–7+ years of experience in SRE, DevOps, or Cloud Infrastructure rolesStrong hands-on experience with AWS services (EC2, S3, RDS, Lambda, CloudFormation, etc.)Proficiency in Terraform or other IaC toolsStrong scripting skills in Python , Bash , or GoExperience with Kubernetes and container orchestrationFamiliarity with monitoring, logging, and alerting systemsUnderstanding of networking, DNS, load balancing, and security principles