Key Responsibilities :
- Develop and maintain automation scripts and tools primarily using Python to support infrastructure provisioning, monitoring, and incident response.
- Collaborate with development and operations teams to build and maintain highly available, scalable systems.
- Implement and manage monitoring, alerting, and incident management solutions using tools like Prometheus, Grafana, ELK Stack, Datadog , etc.
- Participate in on-call rotations to respond to and resolve production incidents.
- Conduct root cause analysis of outages and implement preventative measures.
- Design and implement CI / CD pipelines to automate deployment processes.
- Optimize system performance, reliability, and scalability in cloud platforms such as AWS, Azure, or GCP .
- Manage container orchestration platforms like Kubernetes and container tools like Docker .
- Document operational procedures, runbooks, and best practices.
- Drive continuous improvement in system architecture and operational processes.
Qualifications and Requirements :
Bachelor's degree in Computer Science, Engineering, or related field.3+ years of experience as a Site Reliability Engineer , DevOps engineer, or related role.Strong programming skills in Python with experience writing production-grade automation scripts and tools.Experience with cloud platforms ( AWS, Azure, GCP ) and infrastructure-as-code tools such as Terraform, CloudFormation, or Ansible .Proficient with containerization and orchestration technologies like Docker and Kubernetes .Hands-on experience with monitoring and alerting tools ( Prometheus, Grafana, ELK Stack, Datadog ).Solid understanding of Linux system administration and networking concepts.Experience with CI / CD tools such as Jenkins, GitLab CI, CircleCI , or similar.Strong problem-solving, analytical, and communication skills.Familiarity with incident management and ITIL processes is a plus.Skills Required
Jenkins, Aws, Azure, Gcp, Python, Terraform, Cloudformation