About the Role :
We are looking for an experienced Site Reliability Engineer (SRE) to join our team and help us enhance the reliability, scalability, and performance of our cloud-based infrastructure.
As an SRE, you will work collaboratively with development and operations teams to ensure high availability, operational efficiency, and continuous improvement in our production environment.
This role is ideal for someone who has a deep understanding of DevOps principles, strong automation skills, and experience with cloud platforms and container Responsibilities :
- Design, implement, and maintain scalable, highly available, and resilient infrastructure on AWS, GCP, or Azure cloud platforms.
- Manage and automate infrastructure provisioning, scaling, and management using Terraform, Ansible, or similar tools.
- Implement, monitor, and optimize CI / CD pipelines to ensure seamless and reliable release automation.
- Write scripts and automation workflows for environment setup, deployment, and configuration using tools like Jenkins, Terraform, and Ansible.
- Deploy, manage, and optimize containerized applications using Docker and Kubernetes for container orchestration.
- Set up and manage monitoring, alerting, and logging systems to proactively detect and mitigate issues.
Tools include Prometheus, Grafana, ELK stack, etc.
Troubleshoot and resolve issues that arise in production environments, ensuring minimal downtime and optimal performance.Take part in on-call rotations and respond to incidents, collaborating with cross-functional teams to ensure root causes are identified and mitigated.Continuously monitor system performance and optimize resources to improve uptime, latency, and cost efficiency.Review and improve system reliability through performance tuning and proactive capacity planning.Work with software engineers to improve application stability, performance, and scalability in a fast-paced development environment.Create and maintain detailed documentation for systems, processes, and runbooks to ensure knowledge sharing and best practices.Contribute to a culture of continuous improvement by identifying operational inefficiencies and recommending Skills & Experience :3 to 6 years of experience in Site Reliability Engineering (SRE), DevOps, or related roles.Proficient in cloud platforms : AWS, GCP, or Azure.Strong expertise with automation tools such as Terraform, Ansible, Jenkins, or equivalent.Solid experience with containerization and orchestration tools like Docker and Kubernetes.Proficient in setting up and managing monitoring and alerting systems (e., Prometheus, Grafana, ELK stack).Hands-on experience with CI / CD pipelines and release automation.Strong problem-solving skills, particularly in incident management and troubleshooting under pressure.Familiarity with scripting languages such as Python, Bash, or Go is a plus.Experience with infrastructure as code (IaC) practices and toolsref : hirist.tech)