We are seeking a dedicated Reliability Engineer to ensure the optimal performance, availability, and reliability of our systems and infrastructure.
In this role, you will focus on identifying potential issues before they impact users, improving system robustness, and driving continuous improvement in operational practices.
Key Responsibilities :
- Monitor, analyze, and improve the reliability, availability, and performance of systems and services.
- Develop and implement strategies for fault tolerance, disaster recovery, and incident management.
- Collaborate with development, operations, and support teams to identify and mitigate risks.
- Conduct root cause analysis for system failures and develop corrective actions.
- Design and implement automated monitoring, alerting, and reporting solutions.
- Participate in capacity planning and infrastructure scaling to meet growing demand.
- Develop and maintain documentation for reliability standards, processes, and best practices.
- Support continuous improvement initiatives to enhance system resilience.
- Drive adoption of best practices in change management, deployment, and incident response.
- Evaluate new technologies and tools to improve system reliability and performance.
Required Qualifications :
Bachelors degree in Engineering, Computer Science, or a related field.3+ years of experience in reliability engineering, site reliability engineering (SRE), or a related role.Strong understanding of system architecture, networking, and distributed systems.Experience with monitoring and alerting tools such as Prometheus, Grafana, Nagios, or similar.Proficiency in scripting languages such as Python, Bash, or PowerShell.Knowledge of cloud platforms (AWS, Azure, GCP) and container orchestration (Docker, Kubernetes).Experience with incident management and root cause analysis methodologies.Familiarity with automation tools and infrastructure as code (Terraform, Ansible).Strong analytical, problem-solving, and communication skills(ref : hirist.tech)