This job offer is not available in your country.

Principal Engineer, Site Reliability [T500-20295]

ANSRhyderabad, telangana, in

7 days ago

Job description

About T-Mobile

T-Mobile US, Inc. (NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.

TMUS Global Solutions

TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.

TMUS India Private Limited is a subsidiary of T-Mobile US, Inc. and operates as TMUS Global Solutions.

About the Role

As a Principal SRE, you will be a key member of the CFL Platform Engineering and Operations team ,you will lead reliability engineering for AI-powered platforms supporting LLM applications, AI gateways, and enterprise-scale services across finance, credit, collections, and document systems. You will design and implement observability and incident response frameworks, scale high-performance infrastructure, and champion SRE best practices to support secure, automated, and resilient systems.

What You’ll Do

Architect observability and incident response pipelines for LLM, API, and backend systems
Define SLAs, SLIs, alerts, and dashboards for latency, throughput, and availability
Lead high-severity incident response, root cause analysis, and system recovery
Collaborate with AI, Platform, and Security teams to enforce operational guardrails
Implement automation-first strategies using GitLab CI / CD, Terraform, and deployment tooling
Guide infrastructure tuning, capacity planning, and cost optimization
Drive monitoring across hybrid clouds using Prometheus, Grafana, Splunk, Open Telemetry
Support AIOps, model observability, policy enforcement, and audit readiness
Mentor senior SREs and foster a high-ownership, technical excellence culture

What You’ll Bring

Bachelor's or Master’s in Computer Science, Engineering, or related field

7-12 years in SRE, infrastructure, or platform roles in distributed systems

Strong experience in incident management, AI / ML observability, and performance engineering

Hands-on expertise with OpenAI APIs, inference systems, AI gateways, and secure APIs

Proficiency in Python, Java, Bash / PowerShell, YAML

Deep knowledge of CI / CD workflows, GitLab pipelines, and SDLC processes

Experience with Kafka, HAProxy, RabbitMQ, Oracle DB, MongoDB

Proven success in scaling cloud-native platforms on Azure, AWS, GCP, or OCI

Familiarity with AIOps, latency scoring, policy validation, and secure AI operations

Background in compliance, governance, and enterprise risk management for AI systems

Advanced debugging skills across data, infrastructure, networking, and app layers

Leadership in chaos engineering, SLO-based operations, and system resilience

Must Have Skills

Application & Microservice : Java, Spring boot, API & Service Design

Any CI / CD Tools : Gitlab Pipeline / Test Automation / GitHub Actions / Jenkins / Circle CI

App Platform : Docker & Containers (Kubernetes)

Any Databases : SQL & NOSQL (Cassandra / Oracle / Snowflake / MongoDB)

Any Messaging : Kafka, Rabbit MQ

Any Observability / Monitoring : Splunk / Grafana / Open Telemetry / ELK Stack / Datadog / New Relic / Prometheus)

Incident / Change / Problem Management

Nice To Have

Compliance-aligned continuity planning (PCI, SOX)

Error-budget pacts with product / org leadership

Executive Incident / Change / Problem / risk reporting

Observability cost vs coverage trade-offs

Org-wide reliability governance strategy

Create a job alert for this search

Site Reliability Engineer • hyderabad, telangana, in