Talent.com
No longer accepting applications
Senior Site Reliability Engineer – Grafana & Observability

Senior Site Reliability Engineer – Grafana & Observability

AptimizedDelhi, India
5 days ago
Job description

Job Description – Senior Site Reliability Engineer (SRE) – Grafana & Observability

Position : Senior Site Reliability Engineer – Grafana & Observability

Location : [Hyderabad / Hybrid]

Experience : 10–20+ years

Operating globally, Aptimized is a premium ERP, HCM, and Technology Optimization Consulting agency. Our team at Aptimized focuses on helping our customers become intelligent enterprises through leveraging creative technology solutions. At Aptimized, we prioritize our clients’ needs and create tailor-made solutions to deliver success. We understand success is not achieved through chance. We listen to your concerns. We consult with your organization. We accelerate your business. Visit us at our website to learn more about what we can do for you!

We are looking for a highly skilled Senior Site Reliability Engineer (SRE) with deep hands-on experience in Grafana ecosystem, observability engineering, and large-scale monitoring platforms.

The ideal candidate will be an expert in building and managing Grafana dashboards, Managed Grafana, Prometheus monitoring, OpenTelemetry pipelines, and integrating multiple data sources across cloud and on-prem infrastructures.

This role focuses heavily on building real-time observability, improving system reliability, and enabling data-driven operational insights.

Key Responsibilities

Grafana Engineering & Dashboard Development

Build advanced Grafana dashboards with alerts, custom panels, JSON models, and data visualizations.

Work with Grafana Managed (Azure Managed Grafana / AWS Managed Grafana) for enterprise-grade observability.

Integrate Grafana with multiple data sources such as :

Prometheus

ELK / Elasticsearch

Dynatrace

CloudWatch

Azure Monitor

InfluxDB / Telegraf

ServiceNow (incident integrations)

Develop role-based access (RBAC) and multi-tenant dashboard architectures.

Promztheus, Metrics & Alerting

Architect and manage Prometheus metrics pipelines, exporters, recording / alerting rules.

Optimize PromQL queries for high-performance dashboards.

Reduce alert noise through intelligent rule tuning and SLO-driven alerts.

Observability Platform Ownership

Build and maintain end-to-end observability stack :

Grafana + Prometheus + ELK + OpenTelemetry + Cloud-native monitoring tools.

Integrate logs, metrics, traces into unified dashboards.

Establish SLIs, SLOs, error budgets, and real-time reliability insights.

Kubernetes & Cloud Monitoring

Deploy and monitor Kubernetes clusters (AKS, EKS, Rancher).

Configure Grafana Alloy / Prometheus Operator / kube-state-metrics for cluster-level insights.

Implement Infrastructure-as-Code for observability stack deployments.

Automation & Infrastructure as Code

Automate monitoring agent deployments using :

Terraform

Azure DevOps / GitHub / GitLab

FluxCD, Kustomize, Helm

Develop monitoring-as-code for repeatable environment provisioning.

Incident Response & Performance Troubleshooting

Provide deep troubleshooting across infrastructure, network, applications, and microservices.

Build automated dashboards for war rooms and cross-team collaboration.

Leverage Grafana annotations, synthetic monitoring, and event correlation.

Security, Compliance & Governance

Implement secure access to metric / log dashboards using IAM, RBAC, ABAC.

Configure audit logs, long-term retention, and secure storage pipelines.

(Optional : FedRAMP / NIST experience beneficial for regulated workloads.)

Required Skills & Expertise

Grafana & Observability (Primary)

Expert in Grafana dashboard engineering

Prometheus + Alertmanager

Managed Grafana (Azure / AWS)

ELK Stack (Elasticsearch, Logstash, Kibana)

OpenTelemetry (OTEL) metrics & traces

Grafana Alloy, Loki (Bonus)

Cloud Platforms

Azure, AWS, IBM Cloud (Nice-to-have)

CloudWatch, Azure Monitor, App Insights

Containers & Infrastructure

Kubernetes (AKS, EKS)

Docker, Rancher, OpenShift

Linux (RHEL / CentOS)

DevOps & Automation

Terraform, Helm, Kustomize

Git, CI / CD pipelines

Scripting (Python, Bash, PowerShell)

Monitoring Ecosystem

Experience with additional tools is a plus :

Dynatrace

Splunk

Sysdig

AppDynamics

SolarWinds

Moogsoft AI-Ops

Preferred Qualifications

Strong background in SRE, Observability Engineering, DevOps, or Platform Engineering.

Experience with microservices, distributed systems, and cloud-native architectures.

ITIL v3 or industry certifications in AWS / Azure / Kubernetes are a plus.

Education

Bachelor’s degree in Computer Science, Engineering, or equivalent experience.

Certifications in cloud, observability, Grafana, or Kubernetes are an advantage.

Create a job alert for this search

Senior Site Reliability Engineer • Delhi, India

Related jobs
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

o9 Solutions, Inc.Delhi, India
Be part of something revolutionary.At o9 Solutions, our mission is clear : be the Most Valuable Platform (MVP) for enterprises. With our AI-driven platform — the o9 Digital Brain — we integrate globa...Show moreLast updated: 8 days ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

PeoplefyDelhi, India
We’re looking for an SRE who can own reliability for mission-critical services on Azure, shape standards, lead incidents with calm clarity, and drive engineering excellence across teams.Strong site...Show moreLast updated: 3 days ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

AutoRABITDelhi, India
AutoRABIT is the leader in DevSecOps for SaaS platforms such as Salesforce.Its unique metadata-aware capability makes Release Management, Version Control, and Backup & Recovery complete, reliable, ...Show moreLast updated: 30+ days ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

super.moneyDelhi, India
Site Reliability Engineer (SRE) Level 3.Overview : A Site Reliability Engineer (SRE) Level 3 is a senior technical leadership role focused on designing, implementing, and maintaining large-scale, co...Show moreLast updated: 18 days ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

PhonePeDelhi, India
SRE We are looking for engineers who are passionate about reliability, performance, and efficiency, and with experience in building tools, services, and automation to manage and improve production ...Show moreLast updated: 17 days ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

Elios TalentDelhi, India
Key Highlights ️ Build, automate, and support cloud-native infrastructure powering high-availability platforms ⚡ Contribute to automation-first engineering across AWS, Terraform, CI / CD, and observa...Show moreLast updated: 2 days ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

SynechronDelhi, India
We have immediate opportunity for SRE (Senior Site Reliability Engineer) 5+ years.Job Role : - SRE (Senior Site Reliability Engineer). We began life in 2001 as a small, self-funded team of technology...Show moreLast updated: 3 days ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

FlipkartDelhi, India
Hiring Site Reliability Engineers.Excluding internship] Location : Bangalore.The engineer will work in the Reliability and Productivity Engineering team and is responsible for building industry sta...Show moreLast updated: 7 days ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

Landmark GroupDelhi, India
Ensure reliability and high availability of.Java and microservices-based applications.Build and enhance observability using. Prometheus, Grafana, Loki, or New Relic.Collaborate with engineering and ...Show moreLast updated: 10 days ago
  • Promoted
  • New!
Site Reliability Engineer

Site Reliability Engineer

Yum! India Global Services Private LimitedGurugram, Haryana, India
Design, test, implement, deploy, and support continuous integration pipelines that build and deploy to cloud-based environments (development, stage / testing, production). In this role, you will help ...Show moreLast updated: 13 hours ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

OneAdvancedDelhi, India
We’re looking for a Senior SRE Automation Engineer to lead and drive automation across the operations lifecycle.The ideal candidate will be responsible for identifying and implementing automation o...Show moreLast updated: 14 days ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Elios TalentDelhi, India
Senior Site Reliability Engineer.Key Highlights ️ Build, scale, and optimize cloud-native infrastructure powering global, high-availability platforms ⚡ Drive automation-first engineering across AWS...Show moreLast updated: 2 days ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

VXI Global SolutionsDelhi, India
We are looking for a Site Reliability Engineer with 3+ years for Experience into design, implement, and manage robust observability solutions across our cloud infrastructure and applications.The id...Show moreLast updated: 3 days ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

SynechronDelhi, India
We have immediate opportunity for.SRE (Senior Site Reliability Engineer) 5 to 9 years.SRE (Senior Site Reliability Engineer) Job Location : -. About Synechron We began life in 2001 as a small, self-f...Show moreLast updated: 30+ days ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

GigSkyDelhi, India
We're Hiring : Site Reliability Engineer (5–10 Years Experience).Are you passionate about building resilient, scalable, and secure infrastructure? Gigsky is looking for a seasoned Site Reliability E...Show moreLast updated: 30+ days ago
  • Promoted
Senior Site Reliability Engineer (SRE)

Senior Site Reliability Engineer (SRE)

Voya IndiaDelhi, India
About the position We are seeking a strategic and technically adept leader to drive the scalability, resilience, and operational excellence of our enterprise systems. This role will set the vision f...Show moreLast updated: 1 day ago
  • Promoted
Senior Site Reliability Engineer (C# / Python)

Senior Site Reliability Engineer (C# / Python)

EntechMeerut, IN
Senior Software Site Reliability Engineer (C# / Python).You’ll ensure enterprise systems are reliable, scalable, and performant - driving improvements, leading SRE initiatives, and mentoring teams on...Show moreLast updated: 3 days ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

HRhelpdeskDelhi, India
Company is a rapidly growing, private equity backed SaaS product company and provides cloud-based solutions.Job Summary : As a Site Reliability Engineer (SRE), you will be responsible for building ...Show moreLast updated: 7 days ago