ANSR is hiring for one of its clients.
About T-Mobile :
T-Mobile US, Inc. (NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.
TMUS Global Solutions :
TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.
TMUS India Private Limited is a subsidiary of T-Mobile US, Inc. and operates as TMUS Global Solutions.
About the Role :
As Principal Engineer – AIOps, you will be a key member of the CFL Platform Engineering and Operations team you will lead the development of AI-driven operational strategies, observability systems, and automated remediation pipelines that drive resiliency, self-healing, and real-time insight across cloud-native platforms. You will architect telemetry-rich systems, influence enterprise-wide reliability frameworks, and power the next generation of operational intelligence for large-scale AI / ML platforms.
What You’ll Do :
- Architect AI-driven pipelines for monitoring, alerting, remediation, and capacity optimization
- Lead platform observability strategy across LLMs, inference APIs, and AI model infrastructure
- Build and operate self-healing, event-driven automation systems using telemetry and anomaly detection
- Define and integrate AIOps architecture using tools like OpenTelemetry, Prometheus, Splunk, Solo Gateway, and custom LLM agents
- Partner with data science teams to turn model insights into real-time operational signals
- Lead implementation of zero-downtime operations, production-safe chaos testing, and feedback loops
- Scale observability for AI metrics : latency, token usage, drift, throughput, and cost efficiency
- Drive GitOps and CI / CD practices using GitLab and cloud-native automation
- Mentor engineers across observability, reliability, MLOps / LLMOps, and platform automation
- Collaborate with ML, platform, and SRE teams to define SLAs, escalation strategies, and platform-wide reliability goals
- Present engineering strategy, progress, and outcomes to senior leadership
What You’ll Bring :
Bachelor’s or Master’s in Computer Science, Engineering, or related field7-12 years in engineering, with 5+ years in SRE, DevOps, automation, or platform operationsStrong programming skills in Python or Java; experience with scripting and tool integrationDeep expertise in AIOps, observability, and platform automationExperience architecting distributed systems and AI / ML infrastructure (LLMs, inference APIs)Hands-on with Splunk, Prometheus, Grafana, OpenTelemetry, Datadog, or similar toolsUnderstanding of MLOps / LLMOps, model reliability frameworks, and golden set validationFamiliarity with cloud-native infrastructure (Azure preferred; AWS, GCP also accepted)Experience with event-driven middleware such as Kafka, HAProxy, or RabbitMQTrack record in leading large-scale reliability initiatives, mentoring engineers, and scaling platform automationStrong communication and stakeholder collaboration skillsKnowledge of operational compliance, governance, and secure API orchestrationPassion for operational excellence, intelligent systems, and automation-driven transformationMust Have Skills :
Application & Microservice : Java, Spring boot, API & Service DesignAny CI / CD Tools : Gitlab Pipeline / Test Automation / GitHub Actions / Jenkins / Circle CIApp Platform : Docker & Containers (Kubernetes)Any Databases : SQL & NOSQL (Cassandra / Oracle / Snowflake / MongoDB)Any Messaging : Kafka, Rabbit MQAny Observability / Monitoring : Splunk / Grafana / Open Telemetry / ELK Stack / Datadog / New Relic / Prometheus)AIOps Skills : GitOps / ArgoCD / FluxNice To Have :
Compliance-by-default baked into platform APIsCritical migrations & modernization leadershipGolden paths as reference architecturesPlatform strategy, SLAs, roadmapVendor mgmt, ROI / TCO tracking