Senior Site Reliability Engineer (SRE) – ELK Expert | Platform Engineering Practice
Location : India (Remote) - Must be available to work in the EST (US / Canada) Time Zone.
Role Summary :
Are you a Senior Site Reliability Engineer (SRE) with deep ELK expertise, ready to take ownership of large-scale observability infrastructure?
We're looking for an SRE with 7+ years of experience , including 4+ years specializing in the ELK stack (Elasticsearch, Logstash, Kibana) , to join our Platform Engineering Practice . In this role, you’ll design, manage, and scale ELK clusters ingesting 2–3+ TB / day , enhance reliability across distributed systems, and drive automation within Azure cloud environments. This is a high-impact engineering opportunity focused on performance, observability, and operational excellence at scale.
Why Join Us
- Career Growth : Work alongside industry experts on cutting-edge cloud technologies
- Competitive Compensation and Benefits : We recognize and reward top talent
- Exciting, Impactful Work : Design and build scalable, resilient cloud environments
- Strategic Platform Role : Contribute to the foundation of next-gen observability and reliability infrastructure
What You Will Do
Design and Optimize Cloud Infrastructure : Architect scalable, fault-tolerant systems on Microsoft AzureAutomate Everything : Use Terraform, Ansible, and GitHub Actions to streamline deployment and configurationEnsure Reliability and Performance : Proactively monitor, troubleshoot, and resolve production issues using Prometheus, Grafana, and Azure MonitorEnhance Security and Compliance : Implement security best practices across DevOps workflowsCollaborate and Innovate : Work closely with engineering, security, and operations teams to drive automation and efficiencyManage and scale large ELK clusters handling 2–3+ TB / day log volumes, ensuring high availability and performanceOptimize ELK architecture : Implement efficient index lifecycle policies, shard strategies, and hot-warm-cold tiered storageBuild and tune log pipelines : Scale Logstash and Beats pipelines across distributed environmentsSupport Kibana observability layers : Create dashboards, visualizations, and custom alerting frameworks (e.g., Watcher, ElastAlert)What You Bring
7+ years of experience in Site Reliability Engineering, DevOps, or Cloud Engineering4+ years of dedicated, hands-on experience with ELK (Elasticsearch, Logstash, Kibana)Strong experience managing large-scale ELK clusters in production with heavy ingestion (multi-TB / day)Deep knowledge of index tuning, shard allocation, ILM policies , and scaling ELK componentsExpertise in GitHub Actions, Terraform, Ansible, and Infrastructure as Code (IaC)Proficiency in Python, Go, or Bash for automation and scriptingDeep understanding of Kubernetes, Docker , and cloud-native architecturesExperience with observability tools such as Prometheus, Grafana, Azure MonitorAbility to work in a fast-paced, collaborative environment and solve complex operational issuesEducation
Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related fieldCertifications (Nice to Have)
Microsoft Azure certifications : AZ-104 , AZ-400