Site Reliability Engineer - Elastic Kubernetes Service

MNR SolutionsPune

22 hours ago

Job description

Description :

Site Reliability Engineer (SRE) Kubernetes & Cloud

Position Summary :

We are seeking a highly skilled Site Reliability Engineer (SRE) with deep expertise in Kubernetes and cloud technologies (AWS, Azure, or GCP). The SRE will be responsible for designing, deploying, automating, and supporting highly available, scalable, and secure containerized applications in cloud-native environments. You will work closely with development, operations, and security teams to ensure the reliability, performance, and efficiency of our production systems.

Key Responsibilities :

Design, deploy, and manage Kubernetes clusters (on-premises and / or cloud-managed such as EKS, AKS, GKE) to support scalable microservices architectures.
Automate infrastructure provisioning and application deployment using Infrastructure as Code (IaC) tools such as Terraform, Helm, or CloudFormation.
Monitor, troubleshoot, and optimize system performance using observability tools (Prometheus, Grafana, ELK, Datadog, etc.).
Implement and manage CI / CD pipelines to ensure rapid, repeatable, and reliable software delivery.
Ensure system reliability, availability, and security through proactive monitoring, incident response, and root cause analysis.
Develop and maintain runbooks, dashboards, and documentation for operational procedures and system architectures.
Participate in on-call rotations and respond to production incidents, ensuring minimal downtime and fast recovery.
Collaborate with development and operations teams to drive DevOps and SRE best practices, including capacity planning, scaling, and cost optimization.
Continuously improve automation, tooling, and processes to reduce manual work and increase system reliability.

Required Skills & Experience :

3+ years experience as an SRE, DevOps Engineer, or similar role supporting large-scale, production-grade environments.

Expertise in Kubernetes (deployment, scaling, upgrades, troubleshooting, networking, RBAC, etc.).

Hands-on experience with at least one major cloud provider : AWS, Azure, or GCP.

Proficiency in scripting / programming (Python, Bash, Go, etc.).

Experience with IaC tools (Terraform, Helm, CloudFormation, ARM, etc.).

Strong knowledge of Linux systems administration and networking concepts.

Familiarity with monitoring, logging, and alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.).

Experience with CI / CD tools (Jenkins, GitLab CI, ArgoCD, etc.).

Understanding of security best practices in cloud and containerized environments.

Excellent troubleshooting and problem-solving skills.

Strong communication and collaboration skills.

Preferred Qualifications :

Certified Kubernetes Administrator (CKA) or similar certification.

Experience with service mesh (Istio, Linkerd), ingress controllers, and API gateways.

(ref : hirist.tech)

Create a job alert for this search

Site Reliability Engineer • Pune