Job Title : LLM System Monitor - Site Reliability Engineer (SRE).
Location : Bangalore, India (Hybrid - Onsite 3 Days / Week).
Type : Full-Time (Insight Global at Cisco).
Required Skills & Experience :
- 3+ years of experience monitoring and responding to incidents in a globally deployed web application.
- Strong experience with microservices architecture on Kubernetes.
- Deep understanding of observability tools and operational metrics (Grafana, Prometheus, P99, etc.
- Familiarity with AWS services or any major cloud provider.
- Excellent communication and customer service skills - must be able to clearly articulate status and updates to technical and non-technical stakeholders.
- Ability to ramp up quickly, take ownership, and work independently in a fast-pace.
Key Responsibilities :
Monitor Grafana dashboards and observability tools to detect failures and performance issues.Act as the primary SRE for incident response, initiating reports from automated alerts or joining active incident channels.Serve as the main point of contact during incidents, delivering frequent updates to customers and incident commanders.Interpret operational metrics such as Quantiles, P99, and Prometheus data to assess system health.Track and manage permutations of a globally deployed microservices architecture running on Kubernetes.Collaborate with engineering and support teams to resolve issues quickly and efficiently.Maintain strong communication and customer service throughout incident lifecycles.Utilize foundational knowledge of AWS or other cloud platforms to support infrastructure monitoring.Ramp up quickly on existing systems and processes.Why Join?
Work with cutting-edge LLM infrastructure at Cisco.Full-time opportunity with Insight Global.Hybrid flexibility - onsite in Bangalore 3 days / week.Immediate interviews and onboarding.Competitive compensation.(ref : hirist.tech)