Dear Candidate,
Kindly find below JD
Location - Chennai, Bangalore, Hyderabad, Pune and Noida
Experience - 7+
Primary Monitoring & Incident Response
- Provide 24×7 monitoring of Azure infrastructure (compute, network, storage) using tools such as Azure Monitor, Splunk, DynaTrace, and custom dashboards.
- Respond to alerts and triage P1 / P2 escalations via ServiceNow war rooms, performing initial diagnosis and remediation where possible.
- Incident / Change / Exception process adherence.
Capacity & Availability Management
Identify scaling opportunities with virtual machines or service as required and identify zone-redundancy patterns for performance.Keep track of capacity forecasts and proactively identify performance bottlenecks.Backup & Restore Operations
Execute frequent backups (Azure Backup, NetApp Snapshots) and perform basic restore tasks to ensure business continuity.Conduct routine backup verifications / tests to confirm data integrity.Access & Permissions Management
Maintain Azure / NetApp file shares, setting up and adjusting access controls and AD group permissions according to organizational policy.Perform periodic identity and access reviews to ensure principle of least privilege.Logging & Metrics Oversight
Oversee monitoring agents (e.g., Splunk, DynaTrace, Azure Alerts, SystemPulse), ensuring they are up-to-date and generating the right alerts / metrics for L2 to act upon.Collaborate with L3 to fine-tune alert thresholds and logging when chronic issues emerge.Basic Performance Testing
Execute routine performance checks (e.g., load or stress tests) in coordination with L3 teams when potential service degradation is suspected.Document and escalate consistent performance anomalies.SKILL SET & STAFFING CONSIDERATIONS
Comfortable reading and troubleshooting logs / metrics (Splunk, DynaTrace, Azure Monitor).Familiar with Azure Backup services, basic restore procedures, and file share permissions.Proficiency in ticketing systems (ServiceNow), collaborating with other technical teams for escalations.Sufficient knowledge to follow runbooks and standard operating procedures (SOPs).Documentation of standard operating procedures and IaC changes should be continuously updated in a central repository (e.g., Git repos).Familiarity with Epic implementations (on-prem / cloud)