Build products with MVRs and reliability standards , ensuring system resilience and scalability.Set up and operate observability tools across multiple cloud providers, incorporating AI-powered anomaly detection to enhance monitoring.Assist development teams in defining SLO / SLI dashboards and alerts , optimizing alerting signals with ML-based noise reduction techniques .Use Go, Python, or Terraform to automate operational tasks and build self-healing mechanisms.Manage and administer Grafana, Prometheus, Loki, and other observability tools , integrating predictive analytics where beneficial.Troubleshoot and support production environments , using AI-assisted diagnostics where applicable for faster root cause identification.Automate incident response workflows, leveraging AIOps to reduce manual toil and improve MTTR.What Youll Need to be Successful
- Minimum of 5 years experience in a SaaS environment .
- Bachelors degree or equivalent experience.
- Ability to participate in an on-call rotation .
- Strong understanding of networking (OSI model, TCP / IP, DNS), particularly in cloud environments .
- Experience with Linux administration, security hardening, and performance tuning .
- Passion for troubleshooting distributed systems and software failures.
- Deep understanding of observability principles , including log analysis, tracing, and metrics correlation .
- Strong background in infrastructure as code (Terraform, Pulumi) and container orchestration (Kubernetes, ECS, Nomad) .
- Interest in AI-powered automation , including AIOps tools, ML-based alert tuning, and predictive maintenance .
- Experience with Observability tools like Prometheus,grafana or OpenTelemetry with ML-based anomaly detection is a plus.
- Excellent technical writing skills for documenting architectures, processes, and automation workflows
Skills Required
Terraform, Saas, Kubernetes, Incident Response