Position Summary
We are seeking a highly skilled Site Reliability Engineer (SRE) / DevOps Engineer to join our infrastructure team. You will be responsible for designing, building, and maintaining resilient, scalable, and secure infrastructure in cloud-native environments. This role will involve close collaboration with development, QA, and security teams to automate operations, streamline deployments, and drive best practices in observability, security, and performance.
Key Responsibilities
- Design, implement, and manage cloud infrastructure (GCP / AWS / Azure) using Infrastructure as Code (Terraform)
- Build, maintain, and optimize CI / CD pipelines with tools such as GitLab CI, CircleCI, ArgoCD
- Ensure high availability and performance of applications running on Kubernetes (GKE / EKS / AKS) and container orchestration tools
- Implement observability solutions using Prometheus, Grafana, ELK , and other monitoring / logging tools
- Work with development teams to enhance application performance and deployment workflows
- Automate and manage IAM, RBAC, network policies , and vulnerability scanning
- Participate in incident management , root cause analysis, and postmortem processes
- Continuously improve infrastructure reliability and reduce manual operational efforts (toil)
Basic Qualifications
Strong knowledge of Linux system administrationProficiency in scripting languages such as Python, Bash, or GoSolid hands-on experience with cloud platforms (GCP preferred; AWS or Azure acceptable)Proficient in Kubernetes operations , including Helm charts, service meshes, and operatorsExperience with Terraform and Infrastructure as Code best practicesExperience building and maintaining CI / CD pipelines (e.g., GitLab CI, CircleCI, ArgoCD)Familiarity with observability tools (Prometheus, Grafana, ELK, etc.)Good understanding of networking concepts : TCP / IP, DNS, Load Balancing, FirewallsPreferred Qualifications
Experience with advanced networking and service meshes (e.g., Istio)Familiarity with SRE principles : SLOs, SLIs, error budgetsExposure to multi-cluster or hybrid-cloud infrastructure setupsExperience with incident response and post-incident review processesKey Skills (Comma-Separated)
Site Reliability Engineering, DevOps, GCP, AWS, Azure, Terraform, CI / CD, GitLab CI, CircleCI, ArgoCD, Kubernetes, GKE, EKS, AKS, Helm, Prometheus, Grafana, ELK, Python, Bash, Go, IAM, RBAC, Network Policies, Service Mesh, Istio, TCP / IP, DNS, Load Balancers, Firewalls, Monitoring, Logging, Error Budgets, SLOs, SLIs, Incident Management
Skills Required
Cloud Infrastructure, Kubernetes, Prometheus, Grafana, Elk