Description
Job Overview : We are seeking a highly skilled Software Engineer with a focus on incident triage, complex troubleshooting, and defect resolution for cloud-based software systems. This role will be instrumental in diagnosing and resolving production issues, ensuring system stability, and collaborating with engineering teams to prevent future incidents. Experience with Node.js and associated technologies is essential. Key Responsibilities : Monitor, identify, and triage production incidents to assess impact, root cause, and potential resolution paths. Conduct detailed troubleshooting of cloud-based software systems to diagnose complex defects and implement corrective actions. Manage incident escalation processes, ensuring timely communication and coordination with relevant teams. Collaborate with developers to resolve bugs, optimize system performance, and deploy hotfixes as needed. Analyze logs, error reports, and monitoring data to identify patterns and proactively mitigate potential issues. Implement automated monitoring and alerting solutions to detect anomalies and streamline incident response. Document incident response processes, including root cause analysis and preventive measures. Participate in on-call rotation to provide 24 / 7 support for critical incidents. Develop and maintain knowledge base articles, playbooks, and incident runbooks for common issues. Contribute to post-incident reviews, identifying areas for improvement in monitoring, response, and resolution processes. Qualifications : Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent work experience). 3+ years of experience in software engineering, with a focus on incident management and resolution in cloud environments. Strong proficiency in Node.js, including debugging, error handling, and performance optimization. Experience with cloud platforms (AWS, Azure, or GCP), including monitoring and troubleshooting cloud-native applications. Proficiency in logging frameworks (e.g., Winston, Bunyan) and monitoring tools (e.g., Datadog, ELK Stack, CloudWatch). Strong problem-solving skills and ability to perform in high-pressure, time-sensitive scenarios. Experience with CI / CD pipelines and automated deployments (e.g., Jenkins, GitLab CI, AWS CodePipeline). Excellent communication and documentation skills, with a focus on clear incident reporting and knowledge transfer. Ability to work effectively in a cross-functional team, collaborating with developers, DevOps, and product owners. Written and spoken proficiency in English Preferred Skills : Experience with containerization (Docker, Kubernetes). Knowledge of REST APIs, WebSockets, and microservices architecture. Familiarity with incident management frameworks (e.g., ITIL, SRE practices). Understanding of security best practices in cloud-based systems.
Skills Required
Node.js, Elk Stack, Datadog, Jenkins, Cloudwatch, Gcp, Docker, AWS CodePipeline, Azure, Kubernetes, Aws
Senior Analyst • Pune, India