Job Overview :
We are seeking an Observability & Monitoring Engineer with hands-on experience in Datadog to strengthen our monitoring and incident management capabilities. The ideal candidate will be responsible for proactive monitoring, synthetic monitoring setup, RUM analysis, and end-to-end application health checks. The role involves collaborating closely with development and infrastructure teams to ensure application stability, availability, and performance.
Key Responsibilities :
- Implement and manage observability and monitoring solutions using Datadog.
- Perform proactive monitoring of applications, databases, and infrastructure to prevent incidents.
- Conduct RUM (Real User Monitoring) analysis and coordinate with application development teams to resolve performance / user experience issues.
- Carry out daily health checks for applications and databases to ensure smooth operations.
- Create and manage synthetic monitoring in Datadog for critical application workflows.
- Enable and configure log management & processing for hosts as required.
- Provide detailed Root Cause Analysis (RCA) for incidents and service disruptions.
- Initiate and lead bridge calls for critical incidents, ensuring timely updates and RCA sharing.
- Optimize and fine-tune monitoring alerts to minimize noise and improve accuracy.
- Design and configure new alerts, dashboards, and monitor s in Datadog.
Required Skills & Qualifications :
Bachelor’s degree in Computer Science, Information Technology, or a related field.1+ years of experience in Application Monitoring / Observability.Hands-on expertise in Datadog (RUM, APM, Logs, Infrastructure, Synthetic Monitoring).Strong knowledge of incident management processes and RCA preparation.Experience working with Application Development teams for issue resolution.Good understanding of database monitoring and system performance metrics.Ability to lead critical incident bridge calls and communicate effectively with stakeholders.Strong troubleshooting, analytical, and problem-solving skills.Good to Have (Preferred) :
Experience with other monitoring tools (e.g., AppDynamics, Dynatrace, Splunk, Prometheus, Grafana).Familiarity with ServiceNow / Jira / other ticketing tools.Knowledge of ITIL processes (Incident, Problem, and Change Management).Soft Skills :
Strong communication and coordination skills.Ability to work under pressure and in critical situations.Proactive mindset with attention to detail.Team player with strong ownership qualities.