Talent.com
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Nebula Tech SolutionsMumbai, Maharashtra, India
2 days ago
Job description

At Nebula Tech Solutions, we’re building a high-performing SRE team supporting mission-critical applications for our US-based enterprise clients.

We’re now looking for engineers who can go beyond operations — those who can work directly with application code to improve observability, reliability, and performance at scale.

What You’ll Do

✅ Enhance application reliability through code

  • Add or modify code to improve telemetry and resilience in existing applications.
  • Implement and validate retries, timeouts, and failover logic to improve system reliability.
  • Contribute to and review application code changes with a focus on SRE and production-readiness.

✅ Advance observability and telemetry

  • Embed new telemetry data (e.g., counters, histograms, traces, structured logs) into existing services.
  • Add or upgrade OpenTelemetry and related libraries; test for compatibility and regression before rollout.
  • Integrate observability enhancements with Prometheus, Grafana, ELK, and OpenTelemetry pipelines.
  • ✅ Collaborate & support global reliability efforts

  • Partner with developers to ensure metric coverage, tracing, and alerting meet production standards.
  • Participate in incident response, root cause analysis (RCA), and postmortems.
  • Automate recurring operational tasks using Python, Go, or similar scripting.
  • Improve deployment pipelines and infrastructure using Kubernetes, Terraform, Helm, and CI / CD tools.
  • What We’re Looking For

    5+ years of experience in DevOps, SRE, or software development roles.

    Strong coding proficiency in C# or Java (Python or Go is a plus).

    Hands-on experience with Kubernetes, containerized workloads, and microservices architecture.

    Deep understanding of telemetry and observability concepts — metrics, logs, traces, and alerting.

    Familiarity with OpenTelemetry, Prometheus, Grafana, or similar APM tools.

    Strong understanding of resilient design patterns (retry, circuit breaker, fail-fast, graceful degradation).

    Experience collaborating with developers to improve code-level reliability and metric instrumentation.

    Location : Remote – India

    Shift : US Night Shift (Continuous)

    Client : US-based Enterprise Applications

    #NebulaTechSolutions #Hiring #SRE #DevOps #CSharp #Java #OpenTelemetry #Prometheus #ReliabilityEngineering #NightShift

    Create a job alert for this search

    Senior Site Reliability Engineer • Mumbai, Maharashtra, India