Description GSPANN is hiring a Site Reliability Engineer (SRE) for its Pune or Hyderabad location. This full-time role focuses on enhancing the reliability of global eCommerce platforms through automation, observability, and cloud-native tools like Azure, Kubernetes, and Terraform.
Role and Responsibilities
- Use monitoring tools such as Dynatrace, Splunk, Datadog, Grafana, or New Relic in hands-on scenarios.
- Demonstrate strong knowledge of observability tools, trends, and technologies.
- Identify gaps in SRE practices and implement scalable, effective solutions.
- Support cloud-based production environments, with a preference for Microsoft Azure.
- Write automation scripts proficiently, ideally using Python.
- Utilize cloud deployment tools like Ansible, Terraform, and Azure DevOps effectively.
- Work comfortably in containerized environments using Kubernetes and Docker.
- Apply configuration management tools such as Chef, Ansible, or AWS CodeDeploy.
- Troubleshoot complex issues independently and provide quick resolutions.
- Use and configure observability dashboards and manage end-to-end (E2E) monitoring requirements.
- Maintain expertise in cloud and automation tools (e.g., Azure, Python).
- Leverage Continuous Integration / Continuous Deployment (CI / CD) and Infrastructure as Code (IaC) tools like GitLab, Jenkins, Ansible, Terraform, and Azure DevOps.
- Exhibit soft skills including ownership, effective troubleshooting, and strong collaboration.
- Define and monitor Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
- Participate in incident response efforts and conduct Root Cause Analysis (RCA) post-outages.
Skills and Experience
Bachelor's degree in Computer Science, Information Science, Engineering, or a related field.3–8 years of experience in a Site Reliability Engineering (SRE) or DevOps role.Monitor global e-commerce platforms to ensure optimal availability, performance, and efficiency while managing emergency responses.Promote observability best practices and drive operational excellence across systems.Build and maintain comprehensive observability dashboards with end-to-end monitoring.Design solutions and tools that enhance visibility for both internal teams and external stakeholders.Establish instrumentation standards and develop repeatable implementation patterns for engineering teams.Work closely with cross-functional teams to embed high-reliability practices into system design and operations.Apply SRE principles to improve overall system performance and reduce incidents.Automate incident response processes and coordinate outage preparedness across teams.Maintain error budgets, meet SLOs, and ensure consistent uptime of mission-critical services.