We are looking for a dedicated Site Reliability Engineer (SRE) - Cloud Ops to join our team. In this role, you will play a key part in ensuring the stability and scalability of our cloud infrastructure. You will be responsible for monitoring, troubleshooting, and resolving infrastructure and application alerts, managing pipelines, and addressing environment-related issues in a dynamic 24 / 7 operational :
- Infrastructure Monitoring and Alert Response : Proactively monitor infrastructure and application alerts, ensuring prompt resolution to maintain uptime and performance.
- Shift-Based Operations : Work in a 24 / 7 environment with flexible availability for rotational shifts.
- Cloud Environment Management : Manage and resolve environment-related issues, focusing on stability and efficiency.
- Pipeline Management : Oversee CI / CD pipelines and ensure smooth deployment of updates and releases.
- Operational Tasks : Execute day-to-day operational activities, including incident management, change management, and maintaining operational excellence.
- Tool Management : Utilize tools like Kubernetes, PagerDuty, and GCP Cloud to support operational activities.
Requirements :
B. E / B. Tech graduate with 2+ years of experience in Site Reliability, Cloud Ops, Monitoring, and Alerting.Expertise : In-depth knowledge of monitoring tools ( Prometheus, Grafana, ELK ), alert systems, and resolving related issues promptly.Kubernetes : Hands-on experience with Kubernetes for orchestration and container management.PagerDuty : Proficiency in setting up and managing alerting systems.Cloud Fundamentals : Basic understanding of GCP (Google Cloud Platform) services and operations.Incident Management : Strong problem-solving skills and experience in handling critical incidents under pressure.DevOps Processes : Basic knowledge of CI / CD pipelines, automation, and infrastructure-as-code practices.(ref : hirist.tech)