Key Responsibilities :
- Monitoring Tool Support :
- Provide L2 support for various monitoring tools (e.g., Nagios , Zabbix , Splunk , Prometheus , SolarWinds , AppDynamics , New Relic , etc.).
- Troubleshoot and resolve escalated alerts , incidents , and issues related to system performance, application health, network connectivity, and infrastructure availability.
- Collaborate with L1 support teams to assist in the diagnosis and resolution of simpler issues.
- Incident & Problem Management :
- Handle escalated incidents from L1 support, providing root cause analysis (RCA) and resolution.
- Track and maintain records of incidents, problems, and resolutions within the ticketing system (e.g., ServiceNow , JIRA ).
- Ensure SLA compliance for issue resolution and follow-up on tickets to meet agreed-upon timelines.
- Alert Management :
- Review and manage monitoring alerts for critical systems, servers, databases, and applications.
- Ensure alerts are appropriately categorized and routed for resolution.
- Investigate and respond to false positives or irrelevant alerts to maintain the integrity of the monitoring system.
- Performance Monitoring & Reporting :
- Continuously monitor system health , application performance , and network traffic to proactively identify issues before they affect services.
- Maintain and improve monitoring dashboards to reflect the current health of the environment.
- Generate regular reports for system performance and uptime, providing recommendations for improvements or preventive actions.
- Tool Configuration & Optimization :
- Assist in the configuration and tuning of monitoring tools to ensure they provide meaningful and actionable data.
- Customize monitoring thresholds, alerts , and notifications to align with the organization's operational needs.
- Continuously improve the monitoring setup to ensure that it effectively supports the evolving infrastructure and application stack.
- Documentation & Knowledge Sharing :
- Document troubleshooting procedures, known issues, and best practices for the monitoring tools.
- Share knowledge and insights with L1 support teams to improve their troubleshooting capabilities.
- Maintain user manuals or standard operating procedures (SOPs) for monitoring tool management and escalation processes.
- Collaboration & Communication :
- Collaborate with DevOps , System Admins , and Network Engineers to resolve infrastructure or application performance issues.
- Communicate effectively with internal teams regarding ongoing incidents, resolution timelines, and potential impacts on services.
- Proactive System Improvements :
- Work with the IT Operations team to identify and implement proactive measures to improve the overall system performance and reduce downtime.
- Provide input for optimizing monitoring thresholds , reducing false alarms, and implementing new monitoring solutions or features.
Required Qualifications :
2-5 years of experience in L2 support or operations with monitoring tools .Strong understanding of IT infrastructure , including servers, databases, networks, and applications.Hands-on experience with monitoring tools (e.g., Nagios , Zabbix , Prometheus , Splunk , AppDynamics , New Relic , etc.).Experience working with alert management systems and troubleshooting complex issues .Familiarity with cloud environments (AWS, Azure, GCP) and the related monitoring tools.Solid understanding of system performance metrics and the ability to identify and troubleshoot issues based on performance data.Experience using ticketing systems (e.g., ServiceNow , JIRA , Zendesk ) for incident management and tracking.Proficiency in Linux / Unix and Windows Server operating systems.Scripting knowledge (e.g., Bash , Python , PowerShell ) for automating monitoring tasks and alerts.Good understanding of networking concepts (DNS, HTTP, TCP / IP, etc.) and their impact on monitoring.Skills Required
Jira, zendesk, Aws, Azure