Key Responsibilities
Manage and scale production systems hosted on Google Cloud Platform (GCP)
Implement SRE best practices : monitoring, alerting, SLAs, SLOs, and error budgets
Automate operational tasks using Infrastructure as Code (IaC) tools like Terraform
Improve system reliability and reduce manual interventions through automation
Collaborate with development teams to ensure new services are production-ready
Incident response and post-mortem analysis to prevent recurring issues
Design and implement CI / CD pipelines for rapid and safe deployments
Manage GCP resources : IAM, VPC, Compute Engine, GKE, Cloud Functions, Pub / Sub, BigQuery, etc.
Ensure security, compliance, and cost optimization on the cloud infrastructure
Required Skills & Qualifications
5+ years of experience in SRE, DevOps, or Cloud Infrastructure roles
Strong hands-on experience with Google Cloud Platform (GCP) services
Proficiency with Terraform or other IaC tools
Solid knowledge of Kubernetes (GKE) , containerization, and microservices
Strong scripting skills in Python, Go, or Shell
Familiarity with incident response and post-mortem culture
Knowledge of networking, security, and cloud cost management
Preferred Qualifications
GCP certifications (e.g., Professional Cloud DevOps Engineer )
Prior experience working with e-commerce or high-scale platforms
Familiarity with SRE tooling like Chaos Engineering, Service Mesh (Istio), etc.
Soft Skills
Strong communication and stakeholder management
Problem-solving mindset with a focus on reliability and automation
Ability to work independently in a distributed, outsourced team model
Site Reliability Engineer • Hyderabad, Telangana, India