Role Overview :
We are seeking a highly experienced and technically proficient Site Reliability Engineer (SRE) to join our team in support of our client, Qincline. The ideal candidate will have 7 or more years of dedicated experience in Site Reliability Engineering or a closely related discipline. This pivotal role requires a strong focus on ensuring the reliability, scalability, performance, and operational efficiency of large-scale, complex production systems. You'll be instrumental in bridging the gap between development and operations by applying engineering principles to operational challenges.
Key Responsibilities :
Reliability & Performance Engineering :
- System Reliability : Design, build, and maintain robust, fault-tolerant production systems and infrastructure to meet stringent Service Level Objectives (SLOs).
- Performance Tuning : Proactively identify and resolve performance bottlenecks across the entire application stack, from infrastructure to application code.
- Automation : Develop and implement automation for operational tasks, infrastructure provisioning, deployment, and monitoring to eliminate manual toil.
- Capacity Planning : Collaborate with development teams on capacity planning, forecasting demand, and ensuring the infrastructure can scale efficiently to meet future business needs.
Operations & Incident Management :
Monitoring & Alerting : Establish and maintain comprehensive monitoring, logging, and alerting systems to gain deep visibility into system health and performance (e.g., using Prometheus, Grafana, ELK Stack, etc.).Incident Response : Serve as a key responder during critical incidents, performing rapid triage, mitigation, and recovery.Post-Mortems & RCA : Lead detailed Post-Mortem and Root Cause Analysis (RCA) processes for all significant incidents, ensuring that permanent fixes and preventative measures are implemented to prevent recurrence.On-Call : Participate in a periodic on-call rotation to provide 24 / 7 support for critical production systems.Tooling & Infrastructure :
CI / CD & DevOps : Enhance and manage CI / CD pipelines to facilitate fast, reliable, and automated software releases.Containerization & Orchestration : Manage and optimize containerized environments using Docker and Kubernetes.Infrastructure as Code (IaC) : Utilize IaC tools (e.g., Terraform, Ansible) to provision and manage infrastructure in a repeatable and documented manner.Required Skills & Experience :
Core Experience (7+ Years) :
Minimum 7 years of hands-on experience in a Site Reliability Engineer, DevOps Engineer, or Production Engineer role supporting high-availability, mission-critical production environments.Deep expertise in establishing and improving system monitoring, logging, alerting, and telemetry practices.Demonstrated experience with formal Incident Management processes and leading thorough Root Cause Analysis (RCA).Technical Expertise :
Cloud Platforms : Extensive, hands-on experience with at least one major cloud provider (e.g., AWS, Azure, or GCP). This includes managing compute, networking, storage, and managed services.Scripting & Programming : Strong proficiency in scripting and programming languages, with mandatory expertise in Python and Shell scripting for automation and tooling.DevOps Tooling : Proven experience with CI / CD pipeline tools (e.g., Jenkins, GitLab CI, Azure DevOps), Git, and artifact repositories.Containerization : Expert-level knowledge of Docker and robust experience with orchestrating large-scale deployments using Kubernetes.Operating Systems : Strong command of Linux / Unix operating systems and networking fundamentals (TCP / IP, DNS, Load Balancing).Desired Qualifications (Good to Have) :
Experience with configuration management tools (e.g., Ansible, Chef, Puppet).Familiarity with service mesh technologies (e.g., Istio, Linkerd).Knowledge of database administration and performance tuning (SQL / NoSQL).Certifications related to SRE, Cloud (e.g., AWS Certified DevOps Engineer), or Kubernetes (CKA, CKAD).(ref : hirist.tech)