About T-Mobile
T-Mobile US, Inc. (NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.
TMUS Global Solutions
TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.
TMUS India Private Limited is a subsidiary of T-Mobile US, Inc. and operates as TMUS Global Solutions.
About the Role
As a Senior Site Reliability Engineer, you will be a key member of the CFL Platform Engineering and Operations team you will play a pivotal role in building and scaling intelligent infrastructure to support AI / ML applications, enterprise services, and LLM-based platforms. You will contribute to the design and implementation of observability frameworks, automation-first operations, and incident response strategies to ensure reliability, performance, and scalability across production systems.
What You’ll Do
Implement and maintain observability, monitoring, and alerting systems for AI platforms and backend services
Design and support telemetry pipelines, logging infrastructure, and dashboards (Splunk, Prometheus, Grafana, Open Telemetry)
Define and monitor SLOs, SLIs, latency, availability, and throughput metrics
Participate in on-call rotations, incident resolution, root cause analysis, and postmortems
Improve CI / CD workflows and infrastructure automation using GitLab pipelines
Optimize and scale infrastructure including Kafka, RMQ, HAProxy, and distributed APIs
Collaborate with engineering teams on governance, compliance, and secure operations
Support capacity planning, cost analysis, and tuning for high-scale performance
Automate repetitive tasks and reduce toil via scripting (Python, Bash, Java)
Contribute to runbooks, knowledge base articles, and SRE best practice documentation
Mentor junior engineers and support a culture of operational excellence and reliability
What You’ll Bring
Bachelor’s degree in Computer Science, Engineering, or a related technical field
4-7 years in SRE, DevOps, platform, or operations engineering roles
Strong hands-on experience in observability, monitoring, and distributed systems troubleshooting
Proficiency in scripting languages such as Python, Bash, or PowerShell
CI / CD experience with GitLab and automation across deployment pipelines
Solid understanding of SQL and NoSQL systems including Oracle DB and MongoDB
Familiarity with Kubernetes, container orchestration, and hybrid cloud (Azure, AWS, GCP, OCI)
Experience working in high-stakes, incident-driven environments
Strong working knowledge of Splunk, Grafana, Prometheus, and other observability tools
Understanding of AI / ML systems, inference APIs, and LLM infrastructure is a plus
Experience in platform compliance, security enforcement, and regulated domains (finance preferred)
Must Have Skills
Application & Microservice : Java, Spring boot, API & Service Design
Any CI / CD Tools : Gitlab Pipeline / Test Automation / GitHub Actions / Jenkins / Circle CI
App Platform : Docker & Containers (Kubernetes)
Any Databases : SQL & NOSQL (Cassandra / Oracle / Snowflake / MongoDB)
Any Messaging : Kafka, Rabbit MQ
Any Observability / Monitoring : Splunk / Grafana / Open Telemetry / ELK Stack / Datadog / New Relic / Prometheus)
Incident / Change / Problem Management
Nice To Have
Multi-region failover (SQL Server, MongoDB, vendors)
Observability platform design (sampling, retention policies)
Own domain SLOs and error budgets
Perf engineering for latency-sensitive apps
Toil automation (SRE bots, operators
Site Reliability Engineer • Hyderabad, India