Talent.com
This job offer is not available in your country.
Site Reliability Engineer

Site Reliability Engineer

ElgebraChennai
3 days ago
Job description

Role Overview :

We are seeking a highly experienced and technically proficient Site Reliability Engineer (SRE) to join our team in support of our client, Qincline. The ideal candidate will have 7 or more years of dedicated experience in Site Reliability Engineering or a closely related discipline. This pivotal role requires a strong focus on ensuring the reliability, scalability, performance, and operational efficiency of large-scale, complex production systems. You'll be instrumental in bridging the gap between development and operations by applying engineering principles to operational challenges.

Key Responsibilities :

Reliability & Performance Engineering :

  • System Reliability : Design, build, and maintain robust, fault-tolerant production systems and infrastructure to meet stringent Service Level Objectives (SLOs).
  • Performance Tuning : Proactively identify and resolve performance bottlenecks across the entire application stack, from infrastructure to application code.
  • Automation : Develop and implement automation for operational tasks, infrastructure provisioning, deployment, and monitoring to eliminate manual toil.
  • Capacity Planning : Collaborate with development teams on capacity planning, forecasting demand, and ensuring the infrastructure can scale efficiently to meet future business needs.

Operations & Incident Management :

  • Monitoring & Alerting : Establish and maintain comprehensive monitoring, logging, and alerting systems to gain deep visibility into system health and performance (e.g., using Prometheus, Grafana, ELK Stack, etc.).
  • Incident Response : Serve as a key responder during critical incidents, performing rapid triage, mitigation, and recovery.
  • Post-Mortems & RCA : Lead detailed Post-Mortem and Root Cause Analysis (RCA) processes for all significant incidents, ensuring that permanent fixes and preventative measures are implemented to prevent recurrence.
  • On-Call : Participate in a periodic on-call rotation to provide 24 / 7 support for critical production systems.
  • Tooling & Infrastructure :

  • CI / CD & DevOps : Enhance and manage CI / CD pipelines to facilitate fast, reliable, and automated software releases.
  • Containerization & Orchestration : Manage and optimize containerized environments using Docker and Kubernetes.
  • Infrastructure as Code (IaC) : Utilize IaC tools (e.g., Terraform, Ansible) to provision and manage infrastructure in a repeatable and documented manner.
  • Required Skills & Experience :

    Core Experience (7+ Years) :

  • Minimum 7 years of hands-on experience in a Site Reliability Engineer, DevOps Engineer, or Production Engineer role supporting high-availability, mission-critical production environments.
  • Deep expertise in establishing and improving system monitoring, logging, alerting, and telemetry practices.
  • Demonstrated experience with formal Incident Management processes and leading thorough Root Cause Analysis (RCA).
  • Technical Expertise :

  • Cloud Platforms : Extensive, hands-on experience with at least one major cloud provider (e.g., AWS, Azure, or GCP). This includes managing compute, networking, storage, and managed services.
  • Scripting & Programming : Strong proficiency in scripting and programming languages, with mandatory expertise in Python and Shell scripting for automation and tooling.
  • DevOps Tooling : Proven experience with CI / CD pipeline tools (e.g., Jenkins, GitLab CI, Azure DevOps), Git, and artifact repositories.
  • Containerization : Expert-level knowledge of Docker and robust experience with orchestrating large-scale deployments using Kubernetes.
  • Operating Systems : Strong command of Linux / Unix operating systems and networking fundamentals (TCP / IP, DNS, Load Balancing).
  • Desired Qualifications (Good to Have) :

  • Experience with configuration management tools (e.g., Ansible, Chef, Puppet).
  • Familiarity with service mesh technologies (e.g., Istio, Linkerd).
  • Knowledge of database administration and performance tuning (SQL / NoSQL).
  • Certifications related to SRE, Cloud (e.g., AWS Certified DevOps Engineer), or Kubernetes (CKA, CKAD).
  • (ref : hirist.tech)

    Create a job alert for this search

    Site Reliability Engineer • Chennai