Roles and Responsibilities :
- Act as the primary point of contact for major incidents and escalations, ensuring rapid response and communication across technical and business teams.
- Lead and coordinate incident resolution efforts involving multiple support teams and stakeholders to restore service as quickly as possible.
- Manage the end-to-end incident lifecycle – detection, logging, categorization, prioritization, resolution, and closure.
- Conduct detailed Root Cause Analysis (RCA) for high-severity incidents and drive implementation of permanent fixes.
- Work closely with AWS cloud infrastructure teams to identify and resolve platform-level or configuration issues.
- Collaborate with architecture and development teams to identify patterns, improve system reliability, and strengthen incident prevention strategies.
- Develop and maintain incident management processes, playbooks, and metrics to improve response efficiency and reduce recurrence.
- Manage communications and stakeholder expectations during critical incidents and post-incident reviews.
- Participate in on-call rotations and ensure 24x7 support coverage as required.
- Continuously drive improvements in monitoring, alerting, and automation to minimize incident impact and MTTR (Mean Time to Recovery).
Required Skills & Qualifications :
8–14 years of experience in Incident Management / Production Support / Site Reliability / IT Operations roles.Strong experience in managing incidents within complex distributed architectures and cloud-based environments (AWS preferred).Expertise in AWS services such as EC2, S3, Lambda, CloudWatch, RDS, and related monitoring and logging tools.Exposure to Redis and Elasticsearch for cache management, data indexing, and performance optimization.Excellent communication and coordination skills to handle high-pressure situations and interact with senior stakeholders.Proven ability to perform Root Cause Analysis (RCA) and implement corrective and preventive measures.Experience with ITIL processes (Incident, Problem, Change Management).Familiarity with tools such as ServiceNow, Jira, CloudWatch, PagerDuty , etc.Strong analytical and problem-solving skills with a proactive approach to issue resolution.Ability to work in 24x7 production support environments and handle critical incident escalations effectively.