Job Description :
Zycus is looking for a Site Reliability Engineer (SRE) with deep expertise in Kubernetes, automation, and Linux systems.
The ideal candidate will have hands-on experience in deploying, administrating, and optimizing large-scale production systems, with a strong focus on microservices architecture, ensuring automation, performance, and reliability across our SaaS platform.
Roles And Responsibilities :
- System Reliability & Uptime : Ensure high availability, performance, and reliability of applications and infrastructure.
- Kubernetes & Cluster Management : Deploy, administer, and maintain Kubernetes clusters, managing scaling, upgrades, and troubleshooting.
- Microservices Management : Handle the deployment, monitoring, and scaling of microservices in distributed environments.
- Incident Management : Respond to production incidents, perform root cause analysis, and implement long-term fixes to prevent recurrence.
- Automation & Infrastructure as Code (IaC) : Automate repetitive tasks, infrastructure provisioning, and deployment workflows using tools like Ansible and Terraform.
- Monitoring & Observability : Implement and maintain monitoring tools (e.
, Prometheus, Grafana, Datadog) to track system health and application performance.
Performance Optimization : Analyze system performance, identify bottlenecks, and optimize resources for better efficiency.Disaster Recovery & Backup : Design and implement backup and disaster recovery (DR) strategies for business continuity.Capacity Planning : Forecast infrastructure needs based on performance trends and business growth to ensure scalability.Security & Compliance : Ensure infrastructure and applications meet security standards and compliance requirements.Collaboration with Dev & Ops Teams : Work closely with development and operations teams to improve deployment pipelines, release processes, and system reliability.Documentation : Maintain clear and detailed documentation of systems, processes, and incident reports for knowledge sharing and compliance.Continuous Improvement : Identify opportunities for improving system architecture, deployment strategies, and automation workflows.Cloud Infrastructure Management : Manage cloud services (AWS, GCP, Azure) for resource optimization, cost management, and automation.On-Call Support : Participate in on-call rotations to handle urgent production issues and ensure rapid recovery.Job Requirement :
Experience : 5 to 12 years.
Technical skills as mentioned below : .
Must Have :
Kubernetes Expertise :
Hands-on experience with installing and provisioning Kubernetes clusters.Deep understanding of core Kubernetes components such as CRI, CNS, ETCD, CoreDNS, KubeProxy.Strong knowledge of Kubernetes internal networking, service discovery, and ingress management.Kubernetes Distributions :
Hands-on experience with different Kubernetes provisioners and distributions.Kubernetes Cluster Administration :
Experience in administering production Kubernetes clusters, including backup and disaster recovery (DR) strategies.Familiarity with cluster health monitoring and troubleshooting issues.Monitoring tools : Exposure to monitoring tools such as Prometheus, Grafana, Datadog or AppDynamics.
Automation & Scripting :
Strong programming skills in Python or Shell, or similar languages.Hands-on experience with Infrastructure-as-Code (IaC) tools such as Terraform or Ansible.Cloud automation experience, ideally with AWS or other major cloud platforms.Operating Systems : Hands-on experience with Linux system : Experience with microservices architecture and managing more than 50 microservices simultaneously.
Good To Have Skills :
Experience with OpenShift virtualization in production environments.Knowledge of AWS EKS, Rancher, or other Kubernetes distributions.CKA (Certified Kubernetes Administrator) certification or equivalent.Experience in fine-tuning RHEL, CentOS, and Ubuntu.Familiarity with DevSecOps practices, container security, and compliance frameworks.(ref : hirist.tech)