Position Overview :
The Site Reliability Engineer (SRE) is responsible for ensuring the stability, scalability, performance, and reliability of production systems and services. This role bridges software development and operations, using automation, monitoring, and performance optimization to build resilient systems that can scale efficiently and recover quickly from failures.
Key Responsibilities :
- Design, build, and maintain highly reliable and scalable systems and infrastructure .
- Automate deployment, monitoring, and maintenance processes using DevOps tools and scripts .
- Implement and manage CI / CD pipelines to support continuous delivery.
- Monitor application performance, identify bottlenecks, and improve uptime and reliability .
- Develop and maintain incident response procedures , including root cause analysis and postmortems.
- Collaborate with development teams to design systems for fault tolerance, load balancing, and failover .
- Manage and optimize cloud infrastructure (AWS, Azure, GCP).
- Implement observability solutions — logging, metrics, tracing, and alerting .
- Maintain strong security and compliance standards across infrastructure.
- Participate in on-call rotations and ensure 24 / 7 system availability.
- Document processes, configurations, and runbooks for operational consistency.
Required Skills & Qualifications :
Bachelor’s degree in Computer Science, Information Technology, or related field .Strong knowledge of Linux / Unix systems administration and shell scripting .Proficiency with automation and configuration tools (Ansible, Terraform, Chef, Puppet).Experience with cloud platforms — AWS, Azure, or Google Cloud.Familiarity with containerization and orchestration tools (Docker, Kubernetes).Solid understanding of CI / CD tools (Jenkins, GitLab CI, CircleCI).Strong experience with monitoring and observability tools (Prometheus, Grafana, ELK Stack, Datadog).Knowledge of networking fundamentals , load balancing, and DNS management.Proficiency in at least one programming language (Python, Go, or Bash).Excellent analytical, problem-solving, and communication skills.Preferred Qualifications :
Experience with infrastructure-as-code (IaC) and serverless architectures .Knowledge of reliability metrics such as SLOs, SLIs, and error budgets.Exposure to database administration (MySQL, PostgreSQL, MongoDB, Redis).Familiarity with security practices for cloud-native systems.Certifications such as AWS Certified DevOps Engineer , Google SRE Certification , or CKA (Certified Kubernetes Administrator) .