Description :
Title : Senior Observability Engineer
Experience : 7+ years
Location : Remote
Type : Full-time
Job Description : About the role :
We want one observability standard across the stack. Today : some services emit metrics, some dont; frontend isnt fully measured; infra / GitOps / app alerts all land in the same place. You will design and roll out the observability blueprint browser ? frontend (React / Next.js / RUM) ? API / GraphQL ? Kubernetes ? alert routing so we can see real user performance, track P95 / P99, keep SLOs for critical envs, and send the right alerts to the right teams.
Must-have :
- 8+ years in Observability / SRE / Platform / Backend with production systems.
- Strong Kubernetes experience (agents / DaemonSets, metadata enrichment).
- OpenTelemetry hands-on for Go, Node.js, and browser knows what to add to code, not just turn on APM.
- RUM experience in Datadog (Preferred) / Grafana / Dynatrace and can set up FE ? BE correlation.
- Solid Prometheus / Grafana or Datadog skills (histograms, recording rules, alerting rules).
- Proven track record of reducing alert noise and standardizing SLOs across teams.
What you will do :
Enable RUM & frontend metrics for React / Next.js (Web Vitals, SPA nav, network timing, JS errors) and link them to backend traces.Instrument API / GraphQL calls so user actions can be traced down to slow endpoints / services.Define SLO / SLI templates (latency, availability, error rate) per environment / client.Design alert strategy : severity levels (P1 / P2 / P3), who gets what (infra vs app vs GitOps), Slack / Teams routing, and escalation.Standardize observability across the stack :Standard labels / tags (service, env, version, tenant)Standard dashboards per service (traffic, errors, latency, saturation)Standard alert rules (burn-rate, error spike, high latency)Standard OTel / Alloy collector config checked into GitRun OTel / Grafana Alloy pipelines for metrics, logs, traces with sampling and cardinality controls.Ship golden dashboards for : Frontend UX, API performance, Service / K8s health.Keep it GitOps : dashboards, alert rules, and collector configs as code (ref : hirist.tech)