Description
- Design, implement, and maintain monitoring, logging, and alerting systems across all environments
- Lead incident response, root cause analysis, and drive post-mortem improvements
- Develop disaster recovery strategies and run regular DR drills
- Work with engineering teams to define and maintain SLAs / SLOs
- Optimize cloud infrastructure for reliability, performance, and cost
- Build automation for deployments, scaling, and recovery
- Manage infrastructure through IaC tools like Terraform, GitLab CI / CD, and Kubernetes
- Participate in on-call rotations and respond to incidents
Required Skills & Experience
4+ years in SRE, DevOps, or similar rolesStrong scripting skills : Python, Bash, ShellExperience with Chef (cookbooks / recipes) and Ansible (tasks / playbooks)Hands-on experience with AWS services (Cognito, EC2, EKS, RDS, CloudWatch, etc.)Strong Kubernetes administration experience in productionProficiency in Terraform or CloudFormationExcellent understanding of observability tools : Prometheus, Grafana, ELK, tracingExperience provisioning metrics, dashboards, queries, and alert rulesKnowledge of PostgreSQL (including replication)Strong understanding of networking, load balancing & security best practicesExperience working with CI / CD and GitOps workflows(ref : hirist.tech)
Skills Required
Rds, Elk, Cloudformation, Chef, Postgresql, Prometheus, Bash, Grafana, Shell, Ec2, Cloudwatch, Terraform, Ansible, Kubernetes, Python, Aws