Senior Site Reliability Engineer (SRE) – ELK Expert | Platform Engineering Practice
Location : India (Remote) -
Must be available to work in the EST (US / Canada) Time Zone.
Role Summary :
Are you a Senior Site Reliability Engineer (SRE) with deep ELK expertise, ready to take ownership of large-scale observability infrastructure?
We're looking for an SRE with
7+ years of experience , including
4+ years specializing in the ELK stack (Elasticsearch, Logstash, Kibana) , to join our
Platform Engineering Practice . In this role, you’ll design, manage, and scale ELK clusters ingesting
2–3+ TB / day , enhance reliability across distributed systems, and drive automation within Azure cloud environments. This is a high-impact engineering opportunity focused on performance, observability, and operational excellence at scale.
Why Join Us
Career Growth :
Work alongside industry experts on cutting-edge cloud technologies
Competitive Compensation and Benefits :
We recognize and reward top talent
Exciting, Impactful Work :
Design and build scalable, resilient cloud environments
Strategic Platform Role :
Contribute to the foundation of next-gen observability and reliability infrastructure
What You Will Do
Design and Optimize Cloud Infrastructure :
Architect scalable, fault-tolerant systems on Microsoft Azure
Automate Everything :
Use Terraform, Ansible, and GitHub Actions to streamline deployment and configuration
Ensure Reliability and Performance :
Proactively monitor, troubleshoot, and resolve production issues using Prometheus, Grafana, and Azure Monitor
Enhance Security and Compliance :
Implement security best practices across DevOps workflows
Collaborate and Innovate :
Work closely with engineering, security, and operations teams to drive automation and efficiency
Manage and scale large ELK clusters
handling
2–3+ TB / day
log volumes, ensuring high availability and performance
Optimize ELK architecture :
Implement efficient index lifecycle policies, shard strategies, and hot-warm-cold tiered storage
Build and tune log pipelines :
Scale Logstash and Beats pipelines across distributed environments
Support Kibana observability layers :
Create dashboards, visualizations, and custom alerting frameworks (e.g., Watcher, ElastAlert)
What You Bring
7+ years of experience
in Site Reliability Engineering, DevOps, or Cloud Engineering
4+ years of dedicated, hands-on experience with ELK (Elasticsearch, Logstash, Kibana)
Strong experience managing
large-scale ELK clusters in production
with heavy ingestion (multi-TB / day)
Deep knowledge of
index tuning, shard allocation, ILM policies , and scaling ELK components
Expertise in GitHub Actions, Terraform, Ansible, and Infrastructure as Code (IaC)
Proficiency in
Python, Go, or Bash
for automation and scripting
Deep understanding of
Kubernetes, Docker , and cloud-native architectures
Experience with
observability tools
such as Prometheus, Grafana, Azure Monitor
Ability to work in a fast-paced, collaborative environment and solve complex operational issues
Education
Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related field
Certifications (Nice to Have)
Microsoft Azure certifications :
AZ-104 ,
AZ-400
Senior Site Reliability Engineer • India