Site Reliability Engineer (SRE)
We are looking for an experienced Site Reliability Engineer (SRE) to ensure the reliability, scalability, and security of our cloud-based data infrastructure.
The ideal candidate will have strong expertise in AWS infrastructure, Kubernetes administration, infrastructure automation, and monitoring tools.
You will collaborate closely with development teams to deploy and maintain data applications in a fast-paced environment.
Key Responsibilities :
- Design, implement, and maintain reliable, scalable, and secure AWS cloud infrastructure focused on supporting data products.
- Automate infrastructure provisioning and management using tools such as Pulumi, Terraform, and policy as-code frameworks.
- Administer and optimize Kubernetes clusters (EKS), ensuring high availability and performance of
containerized applications.
Monitor system health and performance; utilize observability tools to proactively identify and resolve issues.Implement security best practices, compliance controls, and risk mitigation strategies for cloud infrastructure.Collaborate with software engineering teams to support continuous deployment and smooth operation of data pipelines and applications.Troubleshoot infrastructure and application issues; participate in incident response and root cause analysis.Optimize data pipelines and infrastructure for cost-efficiency and operational excellence.Operate effectively in a dynamic, fast-paced development environment with evolving requirements and tight deadlines.Qualifications :
7+ years of hands-on experience in Site Reliability Engineering (SRE) or similar roles.Strong experience with AWS services, including EC2, S3, IAM, CloudFormation, Lambda, and related datainfrastructure components.
Expertise in Kubernetes (EKS) administration and orchestration of containerized workloads.Proficient in infrastructure-as-code tools such as Pulumi and Terraform for automated provisioning and management.Experience with monitoring and observability tools like Prometheus, Grafana, ELK Stack, Datadog, or similar.Solid understanding of security principles and best practices for cloud infrastructure.Strong scripting skills (Python, Bash, or similar) for automation and troubleshooting tasks.Excellent communication, collaboration, and problem-solving skills.Self-motivated, proactive, and able to work independently as well as in cross-functional teams.Preferred Skills :
Experience with CI / CD pipelines and DevOps practices.Familiarity with policy-as-code frameworks like OPA (Open Policy Agent) or HashiCorp Sentinel.Knowledge of data engineering concepts and pipeline optimization.Exposure to container networking, service mesh, and advanced Kubernetes featuresref : hirist.tech)