Principal Engineer, Site Reliability [T500-20295]

TMUS Global SolutionsHyderabad, India

26 days ago

Job description

About T-Mobile : T-Mobile US, Inc. (NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.

TMUS Global Solutions :

TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.

TMUS India Private Limited operates as TMUS Global Solutions.

About the Role :

As a Principal SRE, you will be a key member of the CFL Platform Engineering and Operations team ,you will lead reliability engineering for AI-powered platforms supporting LLM applications, AI gateways, and enterprise-scale services across finance, credit, collections, and document systems. You will design and implement observability and incident response frameworks, scale high-performance infrastructure, and champion SRE best practices to support secure, automated, and resilient systems.

What You’ll Do :

Architect observability and incident response pipelines for LLM, API, and backend systems

Define SLAs, SLIs, alerts, and dashboards for latency, throughput, and availability

Lead high-severity incident response, root cause analysis, and system recovery

Collaborate with AI, Platform, and Security teams to enforce operational guardrails

Implement automation-first strategies using GitLab CI / CD, Terraform, and deployment tooling

Guide infrastructure tuning, capacity planning, and cost optimization

Drive monitoring across hybrid clouds using Prometheus, Grafana, Splunk, Open Telemetry

Support AIOps, model observability, policy enforcement, and audit readiness

Mentor senior SREs and foster a high-ownership, technical excellence culture

What You’ll Bring :

Bachelor's or Master’s in Computer Science, Engineering, or related field

7-12 years in SRE, infrastructure, or platform roles in distributed systems

Strong experience in incident management, AI / ML observability, and performance engineering

Hands-on expertise with OpenAI APIs, inference systems, AI gateways, and secure APIs

Proficiency in Python, Java, Bash / PowerShell, YAML

Deep knowledge of CI / CD workflows, GitLab pipelines, and SDLC processes

Experience with Kafka, HAProxy, RabbitMQ, Oracle DB, MongoDB

Proven success in scaling cloud-native platforms on Azure, AWS, GCP, or OCI

Familiarity with AIOps, latency scoring, policy validation, and secure AI operations

Background in compliance, governance, and enterprise risk management for AI systems

Advanced debugging skills across data, infrastructure, networking, and app layers

Leadership in chaos engineering, SLO-based operations, and system resilience

Must Have Skills :

Application & Microservice : Java, Spring boot, API & Service Design

Any CI / CD Tools : Gitlab Pipeline / Test Automation / GitHub Actions / Jenkins / Circle CI

App Platform : Docker & Containers (Kubernetes)

Any Databases : SQL & NOSQL (Cassandra / Oracle / Snowflake / MongoDB)

Any Messaging : Kafka, Rabbit MQ

Any Observability / Monitoring : Splunk / Grafana / Open Telemetry / ELK Stack / Datadog / New Relic / Prometheus)

Incident / Change / Problem Management

Nice To Have :

Compliance-aligned continuity planning (PCI, SOX)

Error-budget pacts with product / org leadership

Executive Incident / Change / Problem / risk reporting

Observability cost vs coverage trade-offs

Org-wide reliability governance strategy

Create a job alert for this search

Site Reliability Engineer • Hyderabad, India