About The Role :
We are seeking a Senior Observability Engineer to lead the design, implementation, and evolution of our observability stack across hybrid environments including IBM Data Centers, AWS, and IBM Cloud.
This role is critical to ensuring visibility, reliability, and performance of our services deployed via Kubernetes, Kafka, VM-based architectures, and managed data storage services.
You will work closely with Software Engineers, Site Reliability Engineers (SREs), and Service Reliability Owners (SROs) to build and maintain robust monitoring, logging, and alerting systems that support proactive incident detection and resolution.
Key Responsibilities :
- Design and implement observability solutions across cloud and on-prem environments.
- Develop and maintain monitoring dashboards, alerts, and telemetry pipelines.
- Integrate observability tools with Kubernetes, Kafka, and VM-based deployments.
- Collaborate with engineering teams to instrument applications for metrics, logs, and traces.
- Drive adoption of best practices in monitoring, logging, and distributed tracing.
- Lead incident analysis and postmortem reviews to improve system reliability.
- Evaluate and implement tools for log aggregation, metrics collection, and visualization.
- Ensure observability systems are scalable, secure, and cost-effective.
Required Qualifications :
7+ years of experience in observability, monitoring, or SRE roles.Strong expertise in observability tools such as Prometheus, Grafana, ELK / EFK stack, OpenTelemetry, or Datadog.Experience with Kubernetes, Kafka, and VM-based deployments.Familiarity with cloud platforms : AWS, IBM Cloud, and hybrid environments.Proficiency in scripting languages (e.g., Python, Bash) for automation and tooling.Solid understanding of distributed systems and microservices architecture.Experience working with CI / CD pipelines and infrastructure-as-code tools.Excellent communication and collaboration skills.Preferred Qualifications :
Experience with incident response and on-call rotations.Familiarity with service level objectives (SLOs) and error budgets.Knowledge of security and compliance considerations in observability.Experience with APM tools like New Relic, AppDynamics, or Dynatrace.(ref : hirist.tech)