Responsibilities
- System Reliability : Ensuring the reliability of software systems by designing, implementing, and maintaining scalable and reliable infrastructure.
- Automation : Developing automation tools and scripts to streamline operational tasks, reduce manual intervention, and improve overall system efficiency.
- Incident Response and Resolution : Monitoring system performance and responding to incidents promptly to minimize downtime and ensure high availability.
- Capacity Planning : Analyzing system usage patterns and forecasting future capacity needs to ensure that the infrastructure can handle current and future demands.
- Performance Optimization : Identifying and addressing performance bottlenecks in software systems through optimization and tuning.
- Infrastructure as Code (IaC) : Implementing infrastructure as code practices, using tools like Terraform or Ansible, to define and manage infrastructure in a version-controlled and automated manner.
- Monitoring and Logging : Implementing and maintaining monitoring and logging solutions to gain insights into system behavior, troubleshoot issues, and proactively address potential problems.
- On-Call Support : Participating in an on-call rotation to respond to incidents outside of regular working hours and ensure 24 / 7 system availability
- Security : Collaborating with security teams to implement and maintain security best practices in infrastructure and application
- Disaster Recovery Planning : Developing and maintaining disaster recovery plans to ensure that systems can quickly recover from major outages or failures
- Continuous Improvement : Continuously analyzing system performance, reliability, and incidents to identify areas for improvement and implementing changes to enhance overall system resilience.
Skills
Programming Languages : Proficiency in one or more programming languages, commonly Python, Go, Shell, Bash.Automation and Scripting : Strong automation skills using tools like Ansible, Puppet, Chef, or custom scripts. Knowledge of Infrastructure as Code (IaC) tools like TerraformContainerization and Orchestration : Experience with containerization technologies like Docker and container orchestration platforms like Kubernetes.Cloud Computing : Proficiency in any of the cloud platforms such as AWS, Azure, or Google Cloud Platform, and knowledge of managing infrastructure in the cloud.Monitoring and Logging : Familiarity with monitoring tools (e.g., Prometheus, Grafana, ELK stack) and logging frameworks to track system performance and troubleshoot issues.Networking : Understanding of networking concepts, protocols, and troubleshooting skills.Security : Knowledge of security best practices, including encryption, access controls, and vulnerability management.Continuous Integration / Continuous Deployment (CI / CD) : Understanding and implementation of CI / CD pipelines for automated testing and deployment.Load Balancing : Experience in incident response, troubleshooting, and resolution.Version Control : Proficient use of version control systems like Git.Experience and Qualifications
1-2 year of experience in site reliability engineering.B.Tech / M.Tech in computer science, information technology or a related field.Having experience working for a product organization is a plus.Role : Site Reliability Engineer
Industry Type : IT Services & Consulting
Department : Engineering - Software & QA
Employment Type : Full Time, Permanent
Role Category : DevOps
Education
UG : Any Graduate
PG : Any Postgraduate
Skills Required
Cloud Computing, Version Control, Automation