Talent.com
This job offer is not available in your country.
Sr Engineer, Site Reliability [T500-20286]

Sr Engineer, Site Reliability [T500-20286]

ANSRhyderabad, telangana, in
7 days ago
Job description

About T-Mobile

T-Mobile US, Inc. (NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.

TMUS Global Solutions

TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.

TMUS India Private Limited is a subsidiary of T-Mobile US, Inc. and operates as TMUS Global Solutions.

About the Role

As a Senior Site Reliability Engineer, you will be a key member of the CFL Platform Engineering and Operations team you will play a pivotal role in building and scaling intelligent infrastructure to support AI / ML applications, enterprise services, and LLM-based platforms. You will contribute to the design and implementation of observability frameworks, automation-first operations, and incident response strategies to ensure reliability, performance, and scalability across production systems.

What You’ll Do

  • Implement and maintain observability, monitoring, and alerting systems for AI platforms and backend services
  • Design and support telemetry pipelines, logging infrastructure, and dashboards (Splunk, Prometheus, Grafana, Open Telemetry)
  • Define and monitor SLOs, SLIs, latency, availability, and throughput metrics
  • Participate in on-call rotations, incident resolution, root cause analysis, and postmortems
  • Improve CI / CD workflows and infrastructure automation using GitLab pipelines
  • Optimize and scale infrastructure including Kafka, RMQ, HAProxy, and distributed APIs
  • Collaborate with engineering teams on governance, compliance, and secure operations
  • Support capacity planning, cost analysis, and tuning for high-scale performance
  • Automate repetitive tasks and reduce toil via scripting (Python, Bash, Java)
  • Contribute to runbooks, knowledge base articles, and SRE best practice documentation
  • Mentor junior engineers and support a culture of operational excellence and reliability

What You’ll Bring

  • Bachelor’s degree in Computer Science, Engineering, or a related technical field
  • 4-7 years in SRE, DevOps, platform, or operations engineering roles
  • Strong hands-on experience in observability, monitoring, and distributed systems troubleshooting
  • Proficiency in scripting languages such as Python, Bash, or PowerShell
  • CI / CD experience with GitLab and automation across deployment pipelines
  • Solid understanding of SQL and NoSQL systems including Oracle DB and MongoDB
  • Familiarity with Kubernetes, container orchestration, and hybrid cloud (Azure, AWS, GCP, OCI)
  • Experience working in high-stakes, incident-driven environments
  • Strong working knowledge of Splunk, Grafana, Prometheus, and other observability tools
  • Understanding of AI / ML systems, inference APIs, and LLM infrastructure is a plus
  • Experience in platform compliance, security enforcement, and regulated domains (finance preferred)
  • Must Have Skills

  • Application & Microservice : Java, Spring boot, API & Service Design
  • Any CI / CD Tools : Gitlab Pipeline / Test Automation / GitHub Actions / Jenkins / Circle CI
  • App Platform : Docker & Containers (Kubernetes)
  • Any Databases : SQL & NOSQL (Cassandra / Oracle / Snowflake / MongoDB)
  • Any Messaging : Kafka, Rabbit MQ
  • Any Observability / Monitoring : Splunk / Grafana / Open Telemetry / ELK Stack / Datadog / New Relic / Prometheus)
  • Incident / Change / Problem Management
  • Nice To Have

  • Multi-region failover (SQL Server, MongoDB, vendors)
  • Observability platform design (sampling, retention policies)
  • Own domain SLOs and error budgets
  • Perf engineering for latency-sensitive apps
  • Toil automation (SRE bots, operators
  • Create a job alert for this search

    Site Reliability Engineer • hyderabad, telangana, in