Talent.com
No longer accepting applications
▷ Immediate Start : Principal Engineer, Software - AIOps [T500-20350]

▷ Immediate Start : Principal Engineer, Software - AIOps [T500-20350]

TMUS Global SolutionsHyderabad, Telangana, India
22 days ago
Job description

About T-Mobile :

T-Mobile US, Inc. (NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.

TMUS Global Solutions :

TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.

TMUS India Private Limited operates as TMUS Global Solutions.

About the Role :

As Principal Engineer – AIOps, you will be a key member of the CFL Platform Engineering and Operations team you will lead the development of AI-driven operational strategies, observability systems, and automated remediation pipelines that drive resiliency, self-healing, and real-time insight across cloud-native platforms. You will architect telemetry-rich systems, influence enterprise-wide reliability frameworks, and power the next generation of operational intelligence for large-scale AI / ML platforms.

What You’ll Do :

  • Architect AI-driven pipelines for monitoring, alerting, remediation, and capacity optimization
  • Lead platform observability strategy across LLMs, inference APIs, and AI model infrastructure
  • Build and operate self-healing, event-driven automation systems using telemetry and anomaly detection
  • Define and integrate AIOps architecture using tools like OpenTelemetry, Prometheus, Splunk, Solo Gateway, and custom LLM agents
  • Partner with data science teams to turn model insights into real-time operational signals
  • Lead implementation of zero-downtime operations, production-safe chaos testing, and feedback loops
  • Scale observability for AI metrics : latency, token usage, drift, throughput, and cost efficiency
  • Drive GitOps and CI / CD practices using GitLab and cloud-native automation
  • Mentor engineers across observability, reliability, MLOps / LLMOps, and platform automation
  • Collaborate with ML, platform, and SRE teams to define SLAs, escalation strategies, and platform-wide reliability goals
  • Present engineering strategy, progress, and outcomes to senior leadership

What You’ll Bring :

  • Bachelor’s or Master’s in Computer Science, Engineering, or related field
  • 7-12 years in engineering, with 5+ years in SRE, DevOps, automation, or platform operations
  • Strong programming skills in Python or Java; experience with scripting and tool integration
  • Deep expertise in AIOps, observability, and platform automation
  • Experience architecting distributed systems and AI / ML infrastructure (LLMs, inference APIs)
  • Hands-on with Splunk, Prometheus, Grafana, OpenTelemetry, Datadog, or similar tools
  • Understanding of MLOps / LLMOps, model reliability frameworks, and golden set validation
  • Familiarity with cloud-native infrastructure (Azure preferred; AWS, GCP also accepted)
  • Experience with event-driven middleware such as Kafka, HAProxy, or RabbitMQ
  • Track record in leading large-scale reliability initiatives, mentoring engineers, and scaling platform automation
  • Strong communication and stakeholder collaboration skills
  • Knowledge of operational compliance, governance, and secure API orchestration
  • Passion for operational excellence, intelligent systems, and automation-driven transformation
  • Must Have Skills :

  • Application & Microservice : Java, Spring boot, API & Service Design
  • Any CI / CD Tools : Gitlab Pipeline / Test Automation / GitHub Actions / Jenkins / Circle CI
  • App Platform : Docker & Containers (Kubernetes)
  • Any Databases : SQL & NOSQL (Cassandra / Oracle / Snowflake / MongoDB)
  • Any Messaging : Kafka, Rabbit MQ
  • Any Observability / Monitoring : Splunk / Grafana / Open Telemetry / ELK Stack / Datadog / New Relic / Prometheus)
  • AIOps Skills : GitOps / ArgoCD / Flux
  • Nice To Have :

  • Compliance-by-default baked into platform APIs
  • Critical migrations & modernization leadership
  • Golden paths as reference architectures
  • Platform strategy, SLAs, roadmap
  • Vendor mgmt, ROI / TCO tracking
  • Create a job alert for this search

    Principal Software Engineer • Hyderabad, Telangana, India