About the Role
We are seeking a skilled Site Reliability Engineer (SRE) with 4–5 years of hands-on experience to join our engineering team. In this role, you will be responsible for building and maintaining reliable, scalable, and secure infrastructure to support our applications. You will leverage your expertise in automation, cloud platforms, and monitoring to ensure system availability, performance, and resilience.
Key Responsibilities
- Design, implement, and maintain Infrastructure as Code (IaC) using Terraform .
- Manage and optimize workloads deployed on Kubernetes (K8s) and containerized environments (Docker, Helm, etc.).
- Configure, administer, and troubleshoot Linux-based systems ;
write automation scripts using Bash / Shell scripting .
Deploy, manage, and secure workloads in Azure Cloud environments, leveraging PaaS, IaaS, and managed services.Build and optimize CI / CD pipelines using GitHub Actions for automated deployments and testing.Implement, configure, and maintain robust monitoring and alerting systems using Grafana and Azure-native monitoring tools.Collaborate with developers and architects to improve application reliability, scalability, and performance.Proactively identify and resolve reliability and performance issues across distributed systems.Participate in on-call rotations to support production systems and respond to incidents.Required Skills & Qualifications
4–5 years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles.Strong expertise in Terraform and Infrastructure as Code principles.Hands-on experience with Kubernetes, containerization , and orchestration tools.Proficiency in Linux system administration and Bash / Shell scripting .Solid knowledge of Azure Cloud services (networking, compute, storage, monitoring, security).Experience designing and maintaining CI / CD pipelines with GitHub Actions .Strong understanding of monitoring, alerting, and observability tools (Grafana, Prometheus, Azure Monitor, etc.).Familiarity with incident response, troubleshooting, and root cause analysis in distributed systems.Excellent problem-solving, collaboration, and communication skills.