Job Summary :
The Site Reliability Engineer specializing in Infrastructure as Code (IaC) and Terraform is responsible for designing, building, automating, and maintaining cloud infrastructure using modern DevOps and SRE practices.
The role ensures system reliability, scalability, high availability, and operational excellence across production environments.
The engineer will focus heavily on automation, monitoring, CI / CD, incident response, and performance engineering while working closely with developers and platform teams.
Key Responsibilities :
- Design, create, and maintain scalable cloud infrastructure using Terraform.
- Develop reusable Terraform modules, pipelines, and automation frameworks.
- Implement infrastructure provisioning, updates, and rollback workflows through version-controlled IaC.
- Ensure compliance with infrastructure standards, security policies, and cloud governance frameworks.
- Build and manage cloud infrastructure on AWS / Azure / GCP (customize as needed).
- Implement scalable architecture patterns (auto-scaling, load balancing, container orchestration).
- Optimize resource utilization and cost-efficiency.
- Manage VPCs, subnets, security groups, firewalls, IAM, and other cloud services.
- Ensure reliability, resiliency, scalability, and performance of production systems.
- Implement chaos engineering practices, fault injection, and resiliency tests.
- Conduct root cause analysis (RCA) and develop permanent fixes for system failures.
- Define and maintain SLOs, SLIs, SLAs, and error budgets.
- Build and enhance CI / CD pipelines using GitHub Actions, GitLab CI, Jenkins, Azure DevOps, or similar.
- Automate testing, security checks, deployments, and environment provisioning.
- Implement GitOps workflows with tools like ArgoCD or Flux (optional).
- Deploy and manage containerized applications using Docker and Kubernetes.
- Manage clusters (EKS, AKS, GKE, or self-hosted Kubernetes).
- Implement Service Mesh (Istio / Linkerd) is an advantage.
- Manage Helm charts, Kustomize, and Kubernetes controllers.
- Implement and maintain monitoring solutions (Prometheus, Grafana, Datadog, New Relic, CloudWatch, etc.
- Set up centralized logging using ELK / EFK, Cloud Logging, or Splunk.
- Monitor system health, performance metrics, and application behavior.
- Build alerting strategies and auto-remediation systems.
- Implement security best practices across infrastructure and deployments.
- Manage secrets, encryption, access control, and network security.
- Use Terraform Cloud / Enterprise, Sentinel policies, and linting tools for compliance enforcement.
- Participate in security audits, pen tests, and cloud hardening initiatives.
- Participate in on-call rotations and respond to production incidents.
- Troubleshoot and resolve system outages, latency issues, and performance problems.
- Develop runbooks, playbooks, and post-incident reports.
- Automate repetitive operational tasks.
- Work collaboratively with developers, QA, product teams, and other SRE members.
- Assist teams in adopting cloud-native, scalable, and automated practices.
- Maintain up-to-date system documentation, diagrams, and operational SOPs.
- Provide technical guidance and mentorship to junior engineers.
Required Skills & Competencies :
Technical Skills :
Strong experience in Terraform and IaC best practices.Hands-on expertise with major cloud providers (AWS / Azure / GCP).Solid knowledge of Linux administration, networking, and distributed systems.Strong scripting skills (Python, Bash, Shell).Excellent understanding of Kubernetes, Docker, and container orchestration.Strong CI / CD experience.Solid experience with monitoring tools (Grafana, Prometheus, Datadog, ELK).Knowledge of GitOps, configuration management (Ansible), or cloud-native patterns (preferred).Understanding of SRE concepts (SLIs, SLOs, error budgets, toil reduction)(ref : hirist.tech)