Job Description
Job Title : SRE Observability Engineer
Experience : 6 Years
Location : Hyderabad
Notice Period : Immediate Joiners Only
About the Role
We are seeking a highly skilled and motivated SRE Observability Engineer to design, build, and scale observability platforms across our distributed systems. The ideal candidate will have deep expertise in monitoring, logging, tracing, and alerting frameworks along with hands-on experience in Prometheus, Grafana, and Loki.
This role involves close collaboration with Development, DevOps, Infrastructure, and SRE teams to ensure end-to-end visibility, reliability, performance, and availability of critical systems.
Mandatory Skills
- Observability
- Grafana
- Prometheus & Loki (including strong query-writing skills)
Key Responsibilities
Lead the design and implementation of observability solutions spanning monitoring, logging, and distributed tracing across cloud and on-prem environments.Develop and maintain advanced monitoring frameworks using Prometheus, Grafana, Datadog, New Relic, AppDynamics and other observability platforms.Implement and optimize distributed tracing using OpenTelemetry, Jaeger, or Zipkin to enhance application visibility and performance diagnostics.Improve log management pipelines using tools such as Elasticsearch, Splunk, Loki, Fluentd, ensuring efficient log ingestion, parsing, storage, and analysis.Build advanced alerting and anomaly detection mechanisms for proactive issue resolution and improved MTTR.Work with development and SRE teams to enhance observability integration within CI / CD pipelines, microservices, and cloud-native architectures.Automate observability processes using Python, Bash, or Golang to scale operations and reduce manual effort.Ensure observability platforms are resilient, scalable, and cost-effective for large-scale distributed systems.Lead incident response efforts, offering actionable insights through logs, metrics, and traces for rapid troubleshooting.Stay updated on evolving observability, SRE, and monitoring practices to continuously strengthen observability posture.Required Qualifications
5+ years of hands-on experience in Observability, SRE, DevOps, or similar roles, managing large-scale distributed systems.Strong experience designing and implementing solutions using Prometheus, Grafana, Datadog, New Relic, AppDynamics.Expertise in log management tools such as Elasticsearch, Splunk, Loki, Fluentd, including performance optimization.Deep proficiency in distributed tracing frameworks (OpenTelemetry, Jaeger, Zipkin).Hands-on experience with cloud platforms Azure, AWS, or GCP, and Kubernetes-based environments.Strong scripting skills in Python, Bash, or Golang, and experience with IaC tools such as Terraform, Ansible.Solid understanding of system architecture, performance tuning, scalability, and high-availability architectures.Proven experience in guiding teams, providing technical leadership, and enforcing observability best practices.Excellent problem-solving skills with the ability to provide data-driven, actionable insights.Strong stakeholder management, communication, and collaboration abilities.Preferred Qualifications
Experience with AI-driven observability and automated anomaly detection.Familiarity with microservices, serverless, and event-driven architectures.Prior experience in on-call rotations and incident management in high-availability environments.Certifications in cloud platforms, SRE, or observability tools.Requirements
SRE