Key Responsibilities :
- Design, implement, and maintain comprehensive monitoring, logging, and alerting solutions across our production and other environments
- Lead incident response and post-mortem analyses, establishing best practices for problem resolution
- Design and implement disaster recovery strategies and ensure regular testing
- Collaborate with development teams and other stakeholders to implement SLAs for critical services
- Optimize cloud infrastructure for performance, reliability, and cost efficiency
- Develop and maintain automation for deployment, scaling, and recovery procedures
- Run and maintain our infrastructure with cookbooks using Terraform, GitLab CI / CD, and Kubernetes
- Responding to on-call incidents
Required Skills & Experience :
6+ years of experience in SRE, DevOps, or similar rolesWork in a variety of languages : Shell, Chef (recipes, cookbooks) and Ansible (basic syntax, tasks, playbooks), PythonStrong experience in AWS related services : Cognito EC2, EKS, RDS, CloudWatch, etc.,Proficient in Kubernetes administration and operations in production environmentsExperience with infrastructure as code using tools like Terraform or CloudFormationStrong scripting skills with Python, Bash, or similar languagesDeep understanding of observability tools such as Prometheus, Grafana, ELK stack, anddistributed tracing systems
Provisioning and setup of metric in Prometheus, Grafana and alerts; Provision and setup logsand queries for general questions
Experience with PostgreSQL or similar database systems, including replication strategiesKnowledge of network protocols, load balancing, and security best practicesExperience with CI / CD pipelines and Git Ops workflows(ref : hirist.tech)