Talent.com
Principal Engineer, AI Platform Reliability

Principal Engineer, AI Platform Reliability

TMUS Global SolutionsHyderabad, Republic Of India, IN
25 days ago
Job description

About T-Mobile :

T-Mobile US, Inc. (NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.

TMUS Global Solutions :

TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.

TMUS India Private Limited operates as TMUS Global Solutions.

About the Role :

As a Principal SRE, you will be a key member of the CFL Platform Engineering and Operations team ,you will lead reliability engineering for AI-powered platforms supporting LLM applications, AI gateways, and enterprise-scale services across finance, credit, collections, and document systems. You will design and implement observability and incident response frameworks, scale high-performance infrastructure, and champion SRE best practices to support secure, automated, and resilient systems.

What You’ll Do :

  • Architect observability and incident response pipelines for LLM, API, and backend systems
  • Define SLAs, SLIs, alerts, and dashboards for latency, throughput, and availability
  • Lead high-severity incident response, root cause analysis, and system recovery
  • Collaborate with AI, Platform, and Security teams to enforce operational guardrails
  • Implement automation-first strategies using GitLab CI / CD, Terraform, and deployment tooling
  • Guide infrastructure tuning, capacity planning, and cost optimization
  • Drive monitoring across hybrid clouds using Prometheus, Grafana, Splunk, Open Telemetry
  • Support AIOps, model observability, policy enforcement, and audit readiness
  • Mentor senior SREs and foster a high-ownership, technical excellence culture

What You’ll Bring :

  • Bachelor's or Master’s in Computer Science, Engineering, or related field
  • 7-12 years in SRE, infrastructure, or platform roles in distributed systems
  • Strong experience in incident management, AI / ML observability, and performance engineering
  • Hands-on expertise with OpenAI APIs, inference systems, AI gateways, and secure APIs
  • Proficiency in Python, Java, Bash / PowerShell, YAML
  • Deep knowledge of CI / CD workflows, GitLab pipelines, and SDLC processes
  • Experience with Kafka, HAProxy, RabbitMQ, Oracle DB, MongoDB
  • Proven success in scaling cloud-native platforms on Azure, AWS, GCP, or OCI
  • Familiarity with AIOps, latency scoring, policy validation, and secure AI operations
  • Background in compliance, governance, and enterprise risk management for AI systems
  • Advanced debugging skills across data, infrastructure, networking, and app layers
  • Leadership in chaos engineering, SLO-based operations, and system resilience
  • Must Have Skills :

  • Application & Microservice : Java, Spring boot, API & Service Design
  • Any CI / CD Tools : Gitlab Pipeline / Test Automation / GitHub Actions / Jenkins / Circle CI
  • App Platform : Docker & Containers (Kubernetes)
  • Any Databases : SQL & NOSQL (Cassandra / Oracle / Snowflake / MongoDB)
  • Any Messaging : Kafka, Rabbit MQ
  • Any Observability / Monitoring : Splunk / Grafana / Open Telemetry / ELK Stack / Datadog / New Relic / Prometheus)
  • Incident / Change / Problem Management
  • Nice To Have :

  • Compliance-aligned continuity planning (PCI, SOX)
  • Error-budget pacts with product / org leadership
  • Executive Incident / Change / Problem / risk reporting
  • Observability cost vs coverage trade-offs
  • Org-wide reliability governance strategy
  • Create a job alert for this search

    Ai Platform Engineer • Hyderabad, Republic Of India, IN

    Related jobs
    • Promoted
    Principal Engineer

    Principal Engineer

    FV Bankhyderabad, telangana, in
    FV Bank is a fully licensed and regulated U.With a focus on innovation, security, and compliance, FV Bank is Banking the Future by providing USD banking, digital asset custody services, money marke...Show moreLast updated: 10 days ago
    • Promoted
    AI Platform Engineer

    AI Platform Engineer

    CareerXperts ConsultingHyderabad, Republic Of India, IN
    Production ML Systems | GPU Orchestration | Inference at Scale.What You'll Actually Do (Not Buzzwords).Infrastructure That Doesn't Break. Design and maintain the backbone for training, fine-tuning, ...Show moreLast updated: 9 days ago
    • Promoted
    Principal AI System Architect

    Principal AI System Architect

    BPMLinksHyderabad, Republic Of India, IN
    Information Technology, or related field.LLMs, embeddings, and agent-based architectures.AWS, Azure, GCP), with deep understanding of. Neo4j, Pinecone, PostgreSQL (Aurora).FastAPI, Flask, Django, La...Show moreLast updated: 18 days ago
    • Promoted
    Principal Engineer

    Principal Engineer

    Mancer Consulting ServicesHyderabad, Telangana, India
    Shared Responsibility Models : Define and implement clear shared responsibility models, ensuring accountability across teams for infrastructure, platforms and application security and reliability.Co...Show moreLast updated: 20 days ago
    • Promoted
    Lead – AI Platform

    Lead – AI Platform

    G1 GLOBALHyderabad, Telangana, India
    Location : Hyderabad (On-site) Full Time.Reports To : Head of Engineering / CTO.We are hiring a technically hands-on Senior Architect to lead the design, development, and deployment of a modular AI-d...Show moreLast updated: 20 days ago
    • Promoted
    Principal Applied AI Engineer

    Principal Applied AI Engineer

    FoodsmartHyderabad, Republic Of India, IN
    Foodsmart is the leading telenutrition and foodcare solution, backed by a robust network of Registered Dietitians.Our platform is designed to foster healthier food choices, drive lasting behavior c...Show moreLast updated: 6 days ago
    • Promoted
    Principal AI Engineer - Backend Developer

    Principal AI Engineer - Backend Developer

    BPMLinksHyderabad, Telangana, India
    Information Technology, or related field.LLMs, embeddings, and agent-based architectures.AWS, Azure, GCP), with deep understanding of. Neo4j, Pinecone, PostgreSQL (Aurora).FastAPI, Flask, Django, La...Show moreLast updated: 18 days ago
    • Promoted
    Principal Frontend Engineer, AI Solutions

    Principal Frontend Engineer, AI Solutions

    EverestDX IncHyderabad, Republic Of India, IN
    Everest DX – We are a Digital Platform Services company, headquartered in Stamford.Our Platform / Solution includes Orchestration, Intelligent operations with BOTs’, AI-powered analytics for Enterpri...Show moreLast updated: 10 days ago
    • Promoted
    Principal Engineer, Application Reliability

    Principal Engineer, Application Reliability

    TMUS Global SolutionsHyderabad, Republic Of India, IN
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 25 days ago
    • Promoted
    Principal AI Engineer

    Principal AI Engineer

    Track3DHyderabad, Republic Of India, IN
    Job Description : Senior Machine Learning Engineer (3 – 5 Years Experience).Hyderabad, India (Full-time, Onsite).At Track3D, we’re building the future of construction monitoring with our.By turning ...Show moreLast updated: 30+ days ago
    • Promoted
    Principal Data Engineer

    Principal Data Engineer

    ValueLabsHyderabad, Republic Of India, IN
    As a Senior Data Engineer, you will be the lead architect and builder of the data backbone for our entire next-generation digital asset platform. Embedded within our "Agentic AI & Automation" squad,...Show moreLast updated: 8 days ago
    • Promoted
    Principal Data Solutions Engineer

    Principal Data Solutions Engineer

    GenpactHyderabad, Republic Of India, IN
    Ready to build the future with AI?.At Genpact, we don’t just keep up with technology—we set the pace.AI and digital innovation are redefining industries, and we’re leading the charge.Genpact’s AI G...Show moreLast updated: 13 days ago
    • Promoted
    Principal Reliability Engineer

    Principal Reliability Engineer

    ANSRHyderabad, Republic Of India, IN
    To Care for People on Life's Journey®.We have a relentless drive for innovation and excellence.Whether you're engaging with customers at the airport or advancing our IT infrastructure, every team m...Show moreLast updated: 30+ days ago
    • Promoted
    Principal Engineer, Site Reliability [T500-20295]

    Principal Engineer, Site Reliability [T500-20295]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 25 days ago
    • Promoted
    Principal AI Engineer

    Principal AI Engineer

    BPMLinksHyderabad, Republic Of India, IN
    Deep experience with LLMs, embeddings, vector search, knowledge graphs, and RAG / GraphRAG pipelines.Strong programming expertise in Python (LangChain, Haystack, FastAPI), and modern backend framewor...Show moreLast updated: 18 days ago
    • Promoted
    AI Platform Architect

    AI Platform Architect

    QualiZealHyderabad, Republic Of India, IN
    We are seeking an experienced and visionary AI Architect to lead Generative AI development services at QualiZeal.This individual will define the technical architecture, design principles, and gover...Show moreLast updated: 21 days ago
    • Promoted
    AI Platform Engineering Lead

    AI Platform Engineering Lead

    G1 GLOBALHyderabad, Republic Of India, IN
    Location : Hyderabad (On-site) Full Time.Reports To : Head of Engineering / CTO.We are hiring a technically hands-on Senior Architect to lead the design, development, and deployment of a modular AI-d...Show moreLast updated: 21 days ago
    • Promoted
    Principal Engineer, Site Reliability - Accounting Technology [T500-20232]

    Principal Engineer, Site Reliability - Accounting Technology [T500-20232]

    ANSRHyderabad, Telangana, India
    ANSR is hiring for one of its clients.NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flags...Show moreLast updated: 30+ days ago