About T-Mobile
T-Mobile US, Inc. (NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.
TMUS Global Solutions
TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.
TMUS India Private Limited is a subsidiary of T-Mobile US, Inc. and operates as TMUS Global Solutions.
About the Role
As a Principal SRE, you will be a key member of the CFL Platform Engineering and Operations team ,you will lead reliability engineering for AI-powered platforms supporting LLM applications, AI gateways, and enterprise-scale services across finance, credit, collections, and document systems. You will design and implement observability and incident response frameworks, scale high-performance infrastructure, and champion SRE best practices to support secure, automated, and resilient systems.
What You’ll Do
- Architect observability and incident response pipelines for LLM, API, and backend systems
- Define SLAs, SLIs, alerts, and dashboards for latency, throughput, and availability
- Lead high-severity incident response, root cause analysis, and system recovery
- Collaborate with AI, Platform, and Security teams to enforce operational guardrails
- Implement automation-first strategies using GitLab CI / CD, Terraform, and deployment tooling
- Guide infrastructure tuning, capacity planning, and cost optimization
- Drive monitoring across hybrid clouds using Prometheus, Grafana, Splunk, Open Telemetry
- Support AIOps, model observability, policy enforcement, and audit readiness
- Mentor senior SREs and foster a high-ownership, technical excellence culture
What You’ll Bring
Bachelor's or Master’s in Computer Science, Engineering, or related field7-12 years in SRE, infrastructure, or platform roles in distributed systemsStrong experience in incident management, AI / ML observability, and performance engineeringHands-on expertise with OpenAI APIs, inference systems, AI gateways, and secure APIsProficiency in Python, Java, Bash / PowerShell, YAMLDeep knowledge of CI / CD workflows, GitLab pipelines, and SDLC processesExperience with Kafka, HAProxy, RabbitMQ, Oracle DB, MongoDBProven success in scaling cloud-native platforms on Azure, AWS, GCP, or OCIFamiliarity with AIOps, latency scoring, policy validation, and secure AI operationsBackground in compliance, governance, and enterprise risk management for AI systemsAdvanced debugging skills across data, infrastructure, networking, and app layersLeadership in chaos engineering, SLO-based operations, and system resilienceMust Have Skills
Application & Microservice : Java, Spring boot, API & Service DesignAny CI / CD Tools : Gitlab Pipeline / Test Automation / GitHub Actions / Jenkins / Circle CIApp Platform : Docker & Containers (Kubernetes)Any Databases : SQL & NOSQL (Cassandra / Oracle / Snowflake / MongoDB)Any Messaging : Kafka, Rabbit MQAny Observability / Monitoring : Splunk / Grafana / Open Telemetry / ELK Stack / Datadog / New Relic / Prometheus)Incident / Change / Problem ManagementNice To Have
Compliance-aligned continuity planning (PCI, SOX)Error-budget pacts with product / org leadershipExecutive Incident / Change / Problem / risk reportingObservability cost vs coverage trade-offsOrg-wide reliability governance strategy