Job Summary :
We are seeking a highly skilled and proactive Team Lead Observability & RCA to build and lead a team focused on monitoring, alerting, logging, tracing (MELT), and deep Root Cause Analysis (RCA). The ideal consultant will have hands-on expertise in tools like Datadog, a strong understanding of distributed systems, and the ability to collaborate with development, support, and DevOps teams to drive actionable resolutions from observed issues.
Key Responsibilities :
- Build and manage observability pipelines to ingest and correlate Metrics, Events, Logs, and Traces (MELT)from multiple systems across web, mobile, backend, and infra.
- Lead Datadog implementation and optimization, including APM, RUM, dashboards, synthetics, alerting, and anomaly detection features.
- Act as the primary point of contact for triaging and analyzing issues reported via observability tools.
- Perform end-to-end Root Cause Analysis (RCA) on recurring incidents and anomalies, and drive closure by working closely with development, QA, infra, and support teams.
- Define and enforce observability best practices, including tagging strategy, SLO / SLA setup, error budget tracking, and log hygiene.
- Build and maintain dashboards, monitors, and custom views for different stakeholders including Engineering, Support, and Leadership.
- Drive incident review meetings, document learnings, and contribute to postmortems and corrective action plans.
- Continuously evaluate the health of applications and services, and suggest architectural or code-level improvements.
- Collaborate with InfoSec and Compliance teams to ensure telemetry data is protected, compliant, and governed.
- Mentor junior engineers on observability frameworks, diagnostic techniques, and tooling.
Required Skills & Experience :
6 to 10 years of total experience with 24 years in observability, SRE, or monitoring-focused roles.Strong hands-on experience with Datadog (APM, RUM, Infrastructure Monitoring, Synthetics, Dashboards, Alerts, etc.)Solid understanding of MELT concepts and how to structure telemetry for modern applications.Proficient in Root Cause Analysis (RCA), incident lifecycle management, and blameless postmortems.Experience working with microservices, cloud platforms (AWS / GCP / Azure), and containerized environments (Docker / Kubernetes).Strong skills in scripting (Python / Bash) and tools like Fluentd, Logstash, or OpenTelemetry.Familiarity with CI / CD pipelines, version control (Git), and infrastructure-as-code is a plus.Ability to communicate clearly with cross-functional stakeholders, translate technical findings, and influence resolution paths.Strong sense of ownership, analytical thinking, and process improvement mindset.Good to Have :
Experience integrating observability tools with ServiceNow, PagerDuty, or Slack for automated alerting and incident response.Knowledge of ITIL practices, service health modeling, and business KPIs mapping.Certification in Datadog, AWS Cloud Practitioner, or SRE Foundations.Soft Skills :
Strong leadership and team mentoring abilities.Excellent analytical and problem-solving skills.Effective verbal and written communication.Ability to work independently and lead in a fast-paced, production-critical environmentref : hirist.tech)