About T-Mobile : T-Mobile US, Inc. (NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.
TMUS Global Solutions :
TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.
TMUS India Private Limited operates as TMUS Global Solutions.
About the Role :
As a Principal SRE, you will be a key member of the CFL Platform Engineering and Operations team ,you will lead reliability engineering for AI-powered platforms supporting LLM applications, AI gateways, and enterprise-scale services across finance, credit, collections, and document systems. You will design and implement observability and incident response frameworks, scale high-performance infrastructure, and champion SRE best practices to support secure, automated, and resilient systems.
What You’ll Do :
Architect observability and incident response pipelines for LLM, API, and backend systems
Define SLAs, SLIs, alerts, and dashboards for latency, throughput, and availability
Lead high-severity incident response, root cause analysis, and system recovery
Collaborate with AI, Platform, and Security teams to enforce operational guardrails
Implement automation-first strategies using GitLab CI / CD, Terraform, and deployment tooling
Guide infrastructure tuning, capacity planning, and cost optimization
Drive monitoring across hybrid clouds using Prometheus, Grafana, Splunk, Open Telemetry
Support AIOps, model observability, policy enforcement, and audit readiness
Mentor senior SREs and foster a high-ownership, technical excellence culture
What You’ll Bring :
Bachelor's or Master’s in Computer Science, Engineering, or related field
7-12 years in SRE, infrastructure, or platform roles in distributed systems
Strong experience in incident management, AI / ML observability, and performance engineering
Hands-on expertise with OpenAI APIs, inference systems, AI gateways, and secure APIs
Proficiency in Python, Java, Bash / PowerShell, YAML
Deep knowledge of CI / CD workflows, GitLab pipelines, and SDLC processes
Experience with Kafka, HAProxy, RabbitMQ, Oracle DB, MongoDB
Proven success in scaling cloud-native platforms on Azure, AWS, GCP, or OCI
Familiarity with AIOps, latency scoring, policy validation, and secure AI operations
Background in compliance, governance, and enterprise risk management for AI systems
Advanced debugging skills across data, infrastructure, networking, and app layers
Leadership in chaos engineering, SLO-based operations, and system resilience
Must Have Skills :
Application & Microservice : Java, Spring boot, API & Service Design
Any CI / CD Tools : Gitlab Pipeline / Test Automation / GitHub Actions / Jenkins / Circle CI
App Platform : Docker & Containers (Kubernetes)
Any Databases : SQL & NOSQL (Cassandra / Oracle / Snowflake / MongoDB)
Any Messaging : Kafka, Rabbit MQ
Any Observability / Monitoring : Splunk / Grafana / Open Telemetry / ELK Stack / Datadog / New Relic / Prometheus)
Incident / Change / Problem Management
Nice To Have :
Compliance-aligned continuity planning (PCI, SOX)
Error-budget pacts with product / org leadership
Executive Incident / Change / Problem / risk reporting
Observability cost vs coverage trade-offs
Org-wide reliability governance strategy
Site Reliability Engineer • Hyderabad, India