What will you do at Fynd
- Lead, mentor, and grow a team of 2-5 Site Reliability Engineers.
- Define, implement, and advocate SRE best practices like SLAs, SLOs, SLIs, error budgets, and chaos engineering.
- Build and maintain automated CI / CD pipelines and infrastructure using tools like Terraform, Jenkins, or GitHub Actions.
- Own the observability stack—monitoring, alerting, logging, and tracing across microservices and platforms.
- Improve reliability and scalability of services by proactively identifying bottlenecks and automating manual ops tasks.
- Drive incident response practices including on-call rotations, runbooks, and blameless postmortems.
- Ensure high availability and uptime across distributed systems hosted on AWS.
- Collaborate with cross-functional teams to ensure the architecture is cloud-native, secure, and fault-tolerant.
- Implement and optimize systems for cost-efficiency, auto-scaling, and performance.
- Contribute to open source or write technical blogs to share insights and practices with the broader tech community.
- This is a startup, so expect rapid changes and plenty of opportunities to take initiative and drive new initiatives.
Some Specific Requirements
At least 3+ years of experience leading SRE / DevOps / Infrastructure teams, with 5+ years overall in backend, systems, or infrastructure roles.Strong experience managing distributed systems and microservices at scale.Good understanding of Linux, Networking, Load Balancing, and Security concepts.Hands-on experience with AWS services like EC2, ELB, AutoScaling, CloudFront, S3, CloudWatch.Experience with container technologies and orchestration—Docker and Kubernetes is a must.Strong proficiency with Infrastructure-as-Code tools like Terraform, CloudFormation, or Pulumi.Familiarity with observability tools like Prometheus, Grafana, ELK, or Datadog.Programming / scripting skills in Python, Go, Bash or similar for automation and tooling.Understanding of message queues and event-driven architectures using Kafka or RabbitMQ.Ability to manage incidents, write detailed postmortems, and improve reliability across teams and services.Comfortable working in a fast-paced environment with a strong culture of ownership and continuous improvement.Skills Required
Kubernetes, Docker, Prometheus, Grafana, Terraform