About the Role
We are seeking a skilled
Site Reliability Engineer (SRE)
with 4–5 years of hands-on experience to join our engineering team. In this role, you will be responsible for building and maintaining reliable, scalable, and secure infrastructure to support our applications. You will leverage your expertise in automation, cloud platforms, and monitoring to ensure system availability, performance, and resilience.
Key Responsibilities
Design, implement, and maintain
Infrastructure as Code (IaC)
using
Terraform .
Manage and optimize workloads deployed on
Kubernetes (K8s)
and containerized environments (Docker, Helm, etc.).
Configure, administer, and troubleshoot
Linux-based systems ; write automation scripts using
Bash / Shell scripting .
Deploy, manage, and secure workloads in
Azure Cloud
environments, leveraging PaaS, IaaS, and managed services.
Build and optimize
CI / CD pipelines
using
GitHub Actions
for automated deployments and testing.
Implement, configure, and maintain robust
monitoring and alerting
systems using
Grafana
and Azure-native monitoring tools.
Collaborate with developers and architects to improve application reliability, scalability, and performance.
Proactively identify and resolve reliability and performance issues across distributed systems.
Participate in on-call rotations to support production systems and respond to incidents.
Required Skills & Qualifications
4–5 years of experience in
Site Reliability Engineering, DevOps, or Cloud Infrastructure
roles.
Strong expertise in
Terraform
and Infrastructure as Code principles.
Hands-on experience with
Kubernetes, containerization , and orchestration tools.
Proficiency in
Linux system administration
and
Bash / Shell scripting .
Solid knowledge of
Azure Cloud services
(networking, compute, storage, monitoring, security).
Experience designing and maintaining
CI / CD pipelines
with
GitHub Actions .
Strong understanding of
monitoring, alerting, and observability tools
(Grafana, Prometheus, Azure Monitor, etc.).
Familiarity with incident response, troubleshooting, and root cause analysis in distributed systems.
Excellent problem-solving, collaboration, and communication skills.
Site Reliability Engineer • Delhi, India