As an SRE-2 at MoEngage, you'll be a critical member of our SRE team, responsible for the health and performance of key services and contributing directly to the evolution of our infrastructure at a scale that few engineers get to experience. This is your chance to deepen your technical expertise, take on more ownership, and mentor emerging talent while working on a platform that operates at the cutting edge.
What You'll Do to Keep Our Engines Roaring
- Be a Reliability Champion : Take ownership of the reliability, performance, and efficiency of critical services.
- Automate, Automate, Automate : Design, develop, and implement robust automation solutions to eliminate toil, streamline operations, and improve system resilience.
- Battle Incidents (and Win) : Lead troubleshooting efforts for complex production incidents, perform in-depth root cause analysis, and implement sustainable preventative measures.
- Sculpt Our Infrastructure : Actively contribute to the design, implementation, and optimization of our cloud infrastructure on AWS and GCP , leveraging your expertise in technologies like Kubernetes.
- Enhance Observability : Implement and refine advanced monitoring, alerting, and logging solutions to gain deep insights into system behavior and predict potential issues.
- Collaborate for Success : Partner closely with development teams to influence architectural decisions, ensuring reliability, scalability, and security are built in from the start.
- Strengthen Our Security Posture : Implement and advocate for advanced security practices within our infrastructure and operational workflows.
- Drive Efficiency : Analyze and optimize cloud infrastructure spend, identifying and implementing cost-saving opportunities.
- Guide the Next Wave : Mentor and guide SRE-1 engineers, contributing to the growth and knowledge sharing within the team.
- Be Ready for Action : Participate in our on-call rotation, acting as a key point of escalation and resolution for critical issues.
What Makes You the Ideal Candidate
3-5 years of hands-on experience in Site Reliability Engineering, DevOps, or a similar role with a strong focus on production systems.Demonstrated expertise in Python or Go —you have a proven track record of automating complex tasks.Strong command of AWS and / or GCP cloud platforms .In-depth experience with containerization and orchestration using Kubernetes (K8s, ArgoCD, Helm / Kustomize) .Experience with infrastructure as code tools like Terraform or Ansible is highly valued.Solid understanding and experience with monitoring and observability stacks (VictoriaMetrics, Prometheus, Grafana, ELK stack, etc.).Deep knowledge of Linux / Unix systems internals and advanced networking concepts .Proven ability to diagnose and resolve complex issues in large-scale distributed systems.A strong understanding of Cloud Security and Information Security principles and best practices .Experience with cloud cost analysis and optimization techniques.Familiarity with CI / CD pipelines and GitOps methodologies.Experience with messaging queues and distributed systems (Celery, Kafka) is a plus.Excellent communication, collaboration, and problem-solving skills.A desire to mentor and lead by example.Skills Required
Reliability Engineering, Devops, Python, Aws, Kubernetes