Key Responsibilities :
Production System Management
- Manage and support production-grade infrastructure across cloud and data centers.
- Take ownership of monitoring and troubleshooting production systems, including on-call or shift-based support.
- Deep dive into Linux system internals, networking, and debugging production issues.
Monitoring & Observability
Build and improve observability stacks using Prometheus, Grafana, ELK / EFK, or equivalent tools.Partner with developers to ensure new features / services are production-ready with monitoring, logging, and failover strategies.Automation & CI / CD
Develop and maintain automation scripts / tools using Python, Bash, or similar languages.Work with CI / CD tools (Jenkins, GitHub Actions, GitLab CI) to support reliable deployments.Continuously improve system availability, reliability, and performance through automation and process improvements.Incident Management & Reliability
Drive incident management, root cause analysis (RCA), and implement long-term fixes.Automate operational tasks to reduce mean time to recovery (MTTR).Engineer systems to prevent recurring problems and ensure reliability at scale.Skills Required
Python, Bash, Aws, Cloud, Linux