Senior Site Reliability Engineer (SRE) – Job Description
Key Responsibilities
SRE & Application Reliability
Implement and tune SLOs / SLIs, build reliability dashboards, and respond to incidents using Grafana IRM, JSM, and escalation workflows.
Monitor application performance and availability across Kubernetes clusters using Grafana, Prometheus, Loki, Mimir, and Tempo.
Participate in on-call rotation, postmortems, and continual improvement processes.
Application Support & Troubleshooting
Act as the primary escalation point for production issues — whether internal or client-facing.
Monitor logs, traces, and alerts to proactively identify and resolve incidents.
Debug issues across the stack : Kubernetes, Helm releases, application logs, API errors, database bottlenecks.
Coordinate with development, QA, and client teams to ensure timely and effective resolution of issues.
DevOps & Infrastructure Automation
Implement GitOps workflows using FluxCD and ArgoCD to manage Kubernetes deployments.
Manage and maintain infrastructure-as-code using Terraform, Terragrunt, and Azure (Preferred).
Automate CI / CD pipelines with GitHub Actions for Docker image builds, Helm-based deployments, release tagging, etc.
Post-QA & Release Validation
Work closely with QA engineers to validate release branches, tag images, and verify integration across services.
Test application functionality post deployments (sanity and product functional tests).
Assist in defining performance benchmarks (e.g., pgBench for PostgreSQL clusters) and validate pre-production readiness.
Must-Have Qualifications
6–8 years of experience in DevOps, SRE, or Production Support roles.
Strong hands-on experience with Azure and Kubernetes (AKS preferred) and Helm / Kustomize.
Solid knowledge of GitHub Actions, GitOps (FluxCD / ArgoCD), and Terraform / Terragrunt.
Experience with monitoring / logging stacks : Grafana, Prometheus, Loki, Tempo, Mimir, and Incident Response tools.
Experience debugging microservices written in Node.js, Go, or similar.
Excellent troubleshooting and debugging skills across the stack.
Senior Site Reliability Engineer • Bengaluru, Karnataka, India