Key Responsibilities
Manage and scale production systems hosted on
Google Cloud Platform (GCP)
Implement
SRE best practices : monitoring, alerting, SLAs, SLOs, and error budgets
Automate operational tasks using
Infrastructure as Code (IaC)
tools like Terraform
Improve system reliability and reduce manual interventions through automation
Collaborate with development teams to ensure new services are production-ready
Incident response and post-mortem analysis to prevent recurring issues
Design and implement CI / CD pipelines for rapid and safe deployments
Manage GCP resources : IAM, VPC, Compute Engine, GKE, Cloud Functions, Pub / Sub, BigQuery, etc.
Ensure security, compliance, and cost optimization on the cloud infrastructure
Required Skills & Qualifications
5+ years
of experience in SRE, DevOps, or Cloud Infrastructure roles
Strong hands-on experience with
Google Cloud Platform (GCP)
services
Proficiency with
Terraform
or other IaC tools
Solid knowledge of
Kubernetes (GKE) , containerization, and microservices
Strong scripting skills in
Python, Go, or Shell
Familiarity with incident response and post-mortem culture
Knowledge of
networking, security, and cloud cost management
Preferred Qualifications
GCP certifications (e.g.,
Professional Cloud DevOps Engineer )
Prior experience working with e-commerce or high-scale platforms
Familiarity with SRE tooling like Chaos Engineering, Service Mesh (Istio), etc.
Soft Skills
Strong communication and stakeholder management
Problem-solving mindset with a focus on reliability and automation
Ability to work independently in a distributed, outsourced team model
Site Reliability Engineer • India