Senior Site Reliability Engineer (GCP | Terraform | Ansible | SRE | On-Call)
We are looking for a high-impact Site Reliability Engineer (SRE) who will play a key role in ensuring the reliability, availability, and scalability of our production systems on Google Cloud Platform (GCP).
If you thrive in fast-paced environments, excel in incident management, and love building automated, scalable infrastructure—this role is for you.
Responsibilities
Production Reliability & On-Call Excellence
Act as a primary responder in a 24×7 rotational on-call schedule.
Rapidly identify, mitigate, and resolve high-severity production incidents impacting GCP services.
Conduct detailed Root Cause Analysis (RCA) and implement long-term corrective actions.
Infrastructure-as-Code (Ia C)
Design, build, and maintain large-scale, multi-environment infrastructure using Terraform.
Develop reusable modules, follow best practices, and maintain version-controlled infrastructure deployments.
Configuration Management
Build and optimize Ansible playbooks and roles for configuration consistency, patching, and environment provisioning.
Automation & Tooling
Develop automation using Python, Go, or Bash to eliminate operational toil and accelerate engineering productivity.
Drive automation-first culture across the SRE team.
Monitoring, Observability & Tooling
Enhance monitoring, logging, and alerting using tools like Prometheus, Grafana, Stackdriver , or similar.
Improve observability for proactive detection of service health degradation.
Containers & Orchestration
Manage and troubleshoot Kubernetes (GKE) clusters for deployment, scaling, and reliability of containerized applications.
SRE Best Practices
Define and measure SLIs / SLOs , engineer reliability, and reduce toil through automation.
Collaborate closely with Dev Ops, Cloud, and Engineering teams for continuous improvement.
Requirements
Must Have
3+ years of hands-on experience on GCP , including GKE, GCE, VPC networking, IAM, load balancers, security, and networking fundamentals.
Advanced expertise in Terraform for production-grade infrastructure deployments.
Strong Ansible experience for configuration management.
Proven experience in on-call rotations , incident response, and handling critical production issues.
Proficiency in Python, Go, or Bash for automation.
Strong understanding of SRE principles : SLIs / SLOs, error budgets, incident management, RCA.
Experience with Kubernetes , containerization, and troubleshooting distributed systems.
Nice to Have
Exposure to service mesh (Istio / Linkerd).
Experience with CI / CD pipelines (Jenkins, Git Lab CI, Cloud Build).
Networking and security certifications (GCP Associate Cloud Engineer / Professional Cloud Dev Ops Engineer).
What We Offer
Opportunity to work on high-scale, mission-critical systems.
A culture of ownership, innovation, and automation.
Competitive compensation + on-call benefits.
Growth opportunities in SRE, Cloud, and Platform Engineering tracks.
How to Apply
Share your updated resume at :
Site Reliability Engineer • Bengaluru, Karnataka, India