Description :
The core responsibilities for the job include the following :
Platform Monitoring and Incident Handling :
- Monitor platform alerts, logs, and dashboards to proactively detect issues.
- Perform initial triage, root-cause analysis, and escalate incidents when necessary.
Operations and Maintenance :
Execute standard operating procedures (SOPs), perform health checks, and complete routine maintenance tasks.Coordinate with engineering and SRE teams to resolve critical issues and maintain SLAs.Documentation and Reporting :
Maintain accurate documentation of issues, actions taken, and resolutions.Contribute to the internal knowledge base to improve future response times.Communication and Collaboration :
Provide timely updates to stakeholders on incident status.Work closely with engineering, product, and operations teams for continuous improvement.Requirements :
Experience : 2+ years in technical support, IT operations, or application monitoring roles.Technical Knowledge :
Familiarity with cloud platforms (AWS, GCP, or Azure).Understanding of Kubernetes basics and containerized environments is a plus.Good grasp of logs, monitoring tools (e. g., Grafana, Prometheus, Datadog, Splunk), and incident management workflows.Soft Skills :
Strong analytical, troubleshooting, and problem-solving skills.Excellent communication skills to collaborate across distributed teams.Work Flexibility :
Comfortable working on night shifts (India time) and handling on-call duties as needed.(ref : hirist.tech)