Description GSPANN is hiring a Senior Site Reliability Engineer (SRE) to join our team in Pune or Hyderabad. This full-time role focuses on enhancing the reliability, scalability, and observability of global cloud-based systems through automation, performance tuning, and modern DevOps practices.
Role and Responsibilities
- Manage and support production environments on cloud platforms, with a strong preference for Microsoft Azure.
- Apply expertise in observability tools such as Dynatrace, Splunk, Datadog, Grafana, and New Relic to monitor system health.
- Implement modern observability practices including end-to-end (E2E) instrumentation, telemetry, and unified dashboard creation.
- Drive organizational change by influencing senior leadership and improving SRE practices company-wide.
- Write automation scripts using Python (strongly preferred) to streamline operations and eliminate manual effort.
- Deploy cloud infrastructure using tools like Ansible, Terraform, and Azure DevOps.
- Work confidently with Continuous Integration / Continuous Deployment (CI / CD) tools such as GitLab, Jenkins, Bamboo, Travis CI, and CircleCI.
- Operate and orchestrate containerized environments using Kubernetes and Docker.
- Troubleshoot complex issues and provide reliable, scalable solutions.
- Embrace continuous learning and demonstrate a strong passion for automation and process improvement.
- Use logging stacks like ELK (Elasticsearch, Logstash, and Kibana), Loki, and Splunk to maintain visibility and traceability.
- Influence organizational adoption of Infrastructure as Code (IaC) and CI / CD methodologies.
- Define and monitor Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
- Lead incident response efforts and perform Root Cause Analysis (RCA) to minimize recurrence.
Skills and Experience
Bachelor’s degree in Computer Science, Information Science, Engineering, or a related discipline.6+ years of experience in Site Reliability Engineering (SRE) or DevOps roles, with a focus on cloud-based production systems.Ensure the availability, low latency, performance, and cost efficiency of global e-commerce platforms.Design and maintain full-stack observability solutions, including dashboards and standardized instrumentation.Implement advanced monitoring and alerting systems tailored for both internal engineering teams and external stakeholders.Advocate for SRE best practices and promote operational excellence across teams and departments.Collaborate with engineering, product, and operations teams to increase reliability and accelerate delivery timelines.Build automation tools that support incident response, system recovery, and software delivery pipelines.Track and maintain error budgets, achieve defined SLOs, and guarantee high uptime for mission-critical services.Identify system bottlenecks and anomalies proactively, ensuring optimal performance under peak loads.Automate infrastructure management to reduce costs and scale efficiently during traffic surges.Lead strategic, cross-functional initiatives that enhance overall system architecture and reliability.