Talent.com
This job offer is not available in your country.
Senior Site Reliability Engineer

Senior Site Reliability Engineer

ConfidentialDelhi, Mumbai, Kolkata
16 days ago
Job description
  • Build products with MVRs and reliability standards , ensuring system resilience and scalability.
  • Set up and operate observability tools across multiple cloud providers, incorporating AI-powered anomaly detection to enhance monitoring.
  • Assist development teams in defining SLO / SLI dashboards and alerts , optimizing alerting signals with ML-based noise reduction techniques .
  • Use Go, Python, or Terraform to automate operational tasks and build self-healing mechanisms.
  • Manage and administer Grafana, Prometheus, Loki, and other observability tools , integrating predictive analytics where beneficial.
  • Troubleshoot and support production environments , using AI-assisted diagnostics where applicable for faster root cause identification.
  • Automate incident response workflows, leveraging AIOps to reduce manual toil and improve MTTR.
  • What Youll Need to be Successful

    • Minimum of 5 years experience in a SaaS environment .
    • Bachelors degree or equivalent experience.
    • Ability to participate in an on-call rotation .
    • Strong understanding of networking (OSI model, TCP / IP, DNS), particularly in cloud environments .
    • Experience with Linux administration, security hardening, and performance tuning .
    • Passion for troubleshooting distributed systems and software failures.
    • Deep understanding of observability principles , including log analysis, tracing, and metrics correlation .
    • Strong background in infrastructure as code (Terraform, Pulumi) and container orchestration (Kubernetes, ECS, Nomad) .
    • Interest in AI-powered automation , including AIOps tools, ML-based alert tuning, and predictive maintenance .
    • Experience with Observability tools like Prometheus,grafana or OpenTelemetry with ML-based anomaly detection is a plus.
    • Excellent technical writing skills for documenting architectures, processes, and automation workflows
    • Skills Required

      Terraform, Saas, Kubernetes, Incident Response

    Create a job alert for this search

    Senior Site Reliability Engineer • Delhi, Mumbai, Kolkata