Job Description : Sr. Site Reliability Engineer (SRE)
We are seeking an experienced and results-driven Sr. Site Reliability Engineer (SRE) to join our team. The SRE will be responsible for ensuring the reliability, scalability, performance, and observability of our infrastructure and services.
This role requires strong expertise in cloud computing, Kubernetes, automation, monitoring, and incident management. The selected candidate will work closely with cross-functional teams to design and implement systems that are resilient, cost-effective, and efficient.
The ideal professional will have hands-on experience in designing and maintaining large-scale distributed systems and a proven track record in cloud-native operations. This position demands a proactive approach to automation, observability, disaster recovery, and incident response.
Key Responsibilities :
- Reliability & Observability : Design, implement, and manage monitoring, logging, and alerting systems to improve visibility across environments. Utilize Prometheus, Grafana, ELK Stack, and distributed tracing tools to ensure system health.
- Incident Management : Lead incident response efforts, participate in on-call rotations, resolve critical issues under pressure, and perform post-mortem analysis to improve future resilience.
- Disaster Recovery & Scalability : Define and implement disaster recovery plans, conduct regular failover drills, and ensure infrastructure is designed for scalability and high availability.
- Cloud Infrastructure Management : Operate and optimize environments hosted on AWS services including EC2, EKS, RDS, Cognito, and CloudWatch. Focus on cost-efficiency, reliability, and security.
- Automation & Infrastructure as Code : Develop and maintain automation frameworks using Terraform or CloudFormation. Implement CI / CD and GitOps workflows with GitLab CI / CD to streamline deployments.
- Kubernetes Administration : Manage production-grade Kubernetes clusters, perform upgrades, troubleshoot bottlenecks, and enforce best practices for high availability.
- Database Operations : Administer PostgreSQL and similar databases, design replication strategies, ensure backup and recovery mechanisms, and monitor performance.
- Networking & Security : Apply knowledge of networking protocols, load balancing, and security principles to protect and optimize infrastructure.
- Cross-team Collaboration : Partner with development and QA teams to establish SLAs and SLOs for critical services, ensuring alignment of operational goals with business requirements.
Required Skills & Experience :
Minimum 4+ years of experience as an SRE, DevOps Engineer, or equivalent role.Strong expertise with AWS services such as EC2, EKS, RDS, Cognito, and CloudWatch.Proficiency in Kubernetes administration in production environments.Hands-on experience with Infrastructure as Code Strong scripting and automation abilities using Python and Bash.Proficiency with observability stacks : Prometheus, Grafana, ELK.Experience in building and maintaining CI / CD pipelines with GitLab CI / CD and GitOps workflows.Solid knowledge of PostgreSQL administration and replication.Understanding of networking fundamentals, load balancing, and security best practices.Ability to manage incident response and prioritize multiple issues effectively.Preferred Qualifications :
Experience with configuration management tools such as Chef or Ansible.Familiarity with monitoring and observability solutions such as Splunk, Datadog, or Dynatrace.Exposure to distributed tracing systems for performance troubleshooting.Certifications including AWS Certified Solutions Architect, AWS Certified DevOps Engineer, or CertifiedKubernetes Administrator (CKA).
(ref : hirist.tech)