Description : Role Overview :
We are seeking an experienced Site Reliability Engineer (SRE) with strong hands-on expertise in Kubernetes, Python, and Linux.
The ideal candidate will be responsible for ensuring reliability, scalability, security, and performance of distributed systems and production workloads.
Mandatory requirement : Candidate must have minimum 4 years of experience in Python, Kubernetes, and SRE.
Profiles not meeting this must be rejected.
Key Responsibilities :
- Design, build, and maintain scalable and reliable production systems using SRE principles.
- Deploy, manage, and optimize workloads on Kubernetes clusters (networking, storage, deployments, scaling, troubleshooting).
- Develop Python automation scripts / tools to improve system efficiency, observability, and reliability.
- Implement CI / CD techniques, system monitoring, disaster recovery, and incident management processes.
- Perform root cause analysis (RCA) and ensure post-incident reviews and preventive actions.
- Work with cross-functional teams to drive automation and reduce manual intervention.
- Improve system reliability through performance tuning, capacity planning, and automated alerts.
- Build and maintain Linux-based production environments.
Essential Skills :
Kubernetes : Networking, storage, deployments, cluster operations, troubleshootingPython : Strong scripting and automation experience (minimum 4 years)Linux : Administration, configuration, system performance & debuggingSRE experience : On-call handling, RCA, reliability engineering, performance, scalabilityGood to Have Skills :
Logging & monitoring tools such as Grafana, Loki, DynatraceExperience with containerization tools (Docker)Exposure to cloud platforms (AWS / GCP / Azure)Soft Skills :
Excellent analytical and debugging skillsStrong communication & documentation abilityOwnership mindset with a focus on continuous improvement(ref : hirist.tech)