At Nebula Tech Solutions , we’re building a high-performing SRE team supporting mission-critical applications for our US-based enterprise clients .
We’re now looking for engineers who can go beyond operations — those who can work directly with application code to improve observability, reliability, and performance at scale.
What You’ll Do
✅ Enhance application reliability through code
- Add or modify code to improve telemetry and resilience in existing applications.
- Implement and validate retries, timeouts, and failover logic to improve system reliability.
- Contribute to and review application code changes with a focus on SRE and production-readiness.
✅ Advance observability and telemetry
Embed new telemetry data (e.g., counters, histograms, traces, structured logs ) into existing services.Add or upgrade OpenTelemetry and related libraries; test for compatibility and regression before rollout.Integrate observability enhancements with Prometheus, Grafana, ELK, and OpenTelemetry pipelines.✅ Collaborate & support global reliability efforts
Partner with developers to ensure metric coverage, tracing, and alerting meet production standards.Participate in incident response, root cause analysis (RCA) , and postmortems.Automate recurring operational tasks using Python, Go, or similar scripting .Improve deployment pipelines and infrastructure using Kubernetes, Terraform, Helm , and CI / CD tools.