Senior Site Reliability Engineer (GCP | Terraform | Ansible | SRE | On-Call)
We are looking for a high-impact Site Reliability Engineer (SRE) who will play a key role in ensuring the reliability, availability, and scalability of our production systems on Google Cloud Platform (GCP) .
If you thrive in fast-paced environments, excel in incident management, and love building automated, scalable infrastructure—this role is for you.
🔧 Responsibilities
Production Reliability & On-Call Excellence
- Act as a primary responder in a 24×7 rotational on-call schedule .
- Rapidly identify, mitigate, and resolve high-severity production incidents impacting GCP services.
- Conduct detailed Root Cause Analysis (RCA) and implement long-term corrective actions.
Infrastructure-as-Code (IaC)
Design, build, and maintain large-scale, multi-environment infrastructure using Terraform .Develop reusable modules, follow best practices, and maintain version-controlled infrastructure deployments.Configuration Management
Build and optimize Ansible playbooks and roles for configuration consistency, patching, and environment provisioning.Automation & Tooling
Develop automation using Python, Go, or Bash to eliminate operational toil and accelerate engineering productivity.Drive automation-first culture across the SRE team.Monitoring, Observability & Tooling
Enhance monitoring, logging, and alerting using tools like Prometheus, Grafana, Stackdriver , or similar.Improve observability for proactive detection of service health degradation.Containers & Orchestration
Manage and troubleshoot Kubernetes (GKE) clusters for deployment, scaling, and reliability of containerized applications.SRE Best Practices
Define and measure SLIs / SLOs , engineer reliability, and reduce toil through automation.Collaborate closely with DevOps, Cloud, and Engineering teams for continuous improvement.🔍 Requirements
Must Have
3+ years of hands-on experience on GCP , including GKE, GCE, VPC networking, IAM, load balancers, security, and networking fundamentals.Advanced expertise in Terraform for production-grade infrastructure deployments.Strong Ansible experience for configuration management.Proven experience in on-call rotations , incident response, and handling critical production issues.Proficiency in Python, Go, or Bash for automation.Strong understanding of SRE principles : SLIs / SLOs, error budgets, incident management, RCA.Experience with Kubernetes , containerization, and troubleshooting distributed systems.Nice to Have
Exposure to service mesh (Istio / Linkerd).Experience with CI / CD pipelines (Jenkins, GitLab CI, Cloud Build).Networking and security certifications (GCP Associate Cloud Engineer / Professional Cloud DevOps Engineer).🌟 What We Offer
Opportunity to work on high-scale, mission-critical systems .A culture of ownership, innovation, and automation.Competitive compensation + on-call benefits.Growth opportunities in SRE, Cloud, and Platform Engineering tracks.📨 How to Apply
Share your updated resume at : deepika.balijepally@eminds.ai