Key Responsibilities :
- Lead and mentor a team of SREs / DevOps Engineers, fostering a culture of ownership, reliability, and continuous improvement.
- Own the availability, scalability, and performance of production systems and services.
- Design and manage distributed systems and microservices architectures at scale.
- Develop and implement incident response strategies, root cause analysis, and create actionable postmortems.
- Drive improvements in infrastructure automation, CI / CD pipelines, and deployment strategies.
- Collaborate with cross-functional teams including engineering, product, and QA to embed SRE best practices.
- Implement observability tools (e.g., Prometheus, Grafana, ELK, Datadog) to monitor system performance and proactively detect issues.
- Manage and optimize cloud infrastructure on AWS, including services such as EC2, ELB,
AutoScaling, S3, CloudFront, and CloudWatch.
Utilize Infrastructure-as-Code tools such as Terraform, CloudFormation, or Pulumi for provisioning and maintaining infrastructure.Apply strong Linux, networking, load balancing, and security principles to ensure platformresilience.
Leverage Docker and Kubernetes for container orchestration and scalable deployments.Build internal tools and automation using Python, Go, or Bash scripting.Support event-driven architectures leveraging Kafka or RabbitMQ for high-throughput, real-time systems.Proactively contribute to reliability-focused architecture and design Skills & Experience :6 - 10 years of overall experience in backend engineering, infrastructure, DevOps, or SRE roles.Minimum 3 years of experience leading SRE, DevOps, or Infrastructure teams.Proven track record managing distributed systems and microservices at scale.Deep understanding of Linux systems, networking fundamentals, load balancing, and infrastructure security.Strong hands-on experience with AWS services : EC2, ELB, AutoScaling, CloudFront, S3, and CloudWatch.Expert-level knowledge of Docker and Kubernetes in production environments.Proficient with Infrastructure-as-Code tools : Terraform, CloudFormation, or Pulumi.Hands-on experience with monitoring and observability tools : Prometheus, Grafana, ELKStack, or Datadog.
Strong scripting or programming skills in Python, Go, Bash, or similar languages.Familiarity with Kafka or RabbitMQ for event-driven and messaging architectures.Excellent incident management skills, including triage, RCA, and communication.Ability to thrive in fast-paced environments and adapt to changing Qualifications :Bachelors degree in Computer Science, Engineering, or a related field.Experience in startup or high-growth environments.Contributions to open-source DevOps or SRE tools are a plus.Certifications in AWS, Kubernetes, or other cloud-native technologies are advantageous.(ref : hirist.tech)