Talent.com
This job offer is not available in your country.
Principal Engineer, Software - AIOps [T500-20350]

Principal Engineer, Software - AIOps [T500-20350]

ANSRHyderabad, Telangana, India
10 hours ago
Job description

ANSR is hiring for one of its clients.

About T-Mobile :

T-Mobile US, Inc. (NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.

TMUS Global Solutions :

TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.

TMUS India Private Limited is a subsidiary of T-Mobile US, Inc. and operates as TMUS Global Solutions.

About the Role :

As Principal Engineer – AIOps, you will be a key member of the CFL Platform Engineering and Operations team you will lead the development of AI-driven operational strategies, observability systems, and automated remediation pipelines that drive resiliency, self-healing, and real-time insight across cloud-native platforms. You will architect telemetry-rich systems, influence enterprise-wide reliability frameworks, and power the next generation of operational intelligence for large-scale AI / ML platforms.

What You’ll Do :

  • Architect AI-driven pipelines for monitoring, alerting, remediation, and capacity optimization
  • Lead platform observability strategy across LLMs, inference APIs, and AI model infrastructure
  • Build and operate self-healing, event-driven automation systems using telemetry and anomaly detection
  • Define and integrate AIOps architecture using tools like OpenTelemetry, Prometheus, Splunk, Solo Gateway, and custom LLM agents
  • Partner with data science teams to turn model insights into real-time operational signals
  • Lead implementation of zero-downtime operations, production-safe chaos testing, and feedback loops
  • Scale observability for AI metrics : latency, token usage, drift, throughput, and cost efficiency
  • Drive GitOps and CI / CD practices using GitLab and cloud-native automation
  • Mentor engineers across observability, reliability, MLOps / LLMOps, and platform automation
  • Collaborate with ML, platform, and SRE teams to define SLAs, escalation strategies, and platform-wide reliability goals
  • Present engineering strategy, progress, and outcomes to senior leadership

What You’ll Bring :

  • Bachelor’s or Master’s in Computer Science, Engineering, or related field
  • 7-12 years in engineering, with 5+ years in SRE, DevOps, automation, or platform operations
  • Strong programming skills in Python or Java; experience with scripting and tool integration
  • Deep expertise in AIOps, observability, and platform automation
  • Experience architecting distributed systems and AI / ML infrastructure (LLMs, inference APIs)
  • Hands-on with Splunk, Prometheus, Grafana, OpenTelemetry, Datadog, or similar tools
  • Understanding of MLOps / LLMOps, model reliability frameworks, and golden set validation
  • Familiarity with cloud-native infrastructure (Azure preferred; AWS, GCP also accepted)
  • Experience with event-driven middleware such as Kafka, HAProxy, or RabbitMQ
  • Track record in leading large-scale reliability initiatives, mentoring engineers, and scaling platform automation
  • Strong communication and stakeholder collaboration skills
  • Knowledge of operational compliance, governance, and secure API orchestration
  • Passion for operational excellence, intelligent systems, and automation-driven transformation
  • Must Have Skills :

  • Application & Microservice : Java, Spring boot, API & Service Design
  • Any CI / CD Tools : Gitlab Pipeline / Test Automation / GitHub Actions / Jenkins / Circle CI
  • App Platform : Docker & Containers (Kubernetes)
  • Any Databases : SQL & NOSQL (Cassandra / Oracle / Snowflake / MongoDB)
  • Any Messaging : Kafka, Rabbit MQ
  • Any Observability / Monitoring : Splunk / Grafana / Open Telemetry / ELK Stack / Datadog / New Relic / Prometheus)
  • AIOps Skills : GitOps / ArgoCD / Flux
  • Nice To Have :

  • Compliance-by-default baked into platform APIs
  • Critical migrations & modernization leadership
  • Golden paths as reference architectures
  • Platform strategy, SLAs, roadmap
  • Vendor mgmt, ROI / TCO tracking
  • Create a job alert for this search

    Principal Software Engineer • Hyderabad, Telangana, India