ANSR is hiring for one of its clients.
About T-Mobile :
T-Mobile US, Inc. (NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.
TMUS Global Solutions :
TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.
TMUS India Private Limited is a subsidiary of T-Mobile US, Inc. and operates as TMUS Global Solutions.
About the Role :
As a Senior AIOps Engineer, you will be a key member of the CFL Platform Engineering and Operations team you will help design and implement next-generation intelligent operations that support AI / ML platforms, LLM-based applications, and large-scale distributed systems. You’ll develop automation, observability, and remediation pipelines that enable predictive insights, reduce incident impact, and enhance the reliability of production environments.
This is a hands-on, technical role where you’ll work closely with SRE, DevOps, data, and platform teams to embed intelligent automation into core operational workflows.
What You’ll Do :
- Develop automation pipelines for anomaly detection, root cause analysis, and self-healing
- Build integrations between monitoring systems and AI / ML models for predictive alerting
- Engineer real-time observability pipelines (logs, metrics, traces) across distributed platforms
- Deploy and manage tools such as OpenTelemetry, Prometheus, Grafana, Splunk, and Datadog
- Extend telemetry coverage for LLM-based systems, APIs, and hybrid cloud environments
- Implement event-driven workflows for incident remediation and automated recovery
- Contribute to intelligent alerting standards, dashboarding, and escalation logic
- Collaborate with SRE and DevOps teams to define and implement reliability automation
- Document playbooks, remediation flows, detection rules, and AIOps patterns
- Partner with platform and data science teams on AIOps architecture and telemetry modeling
What You’ll Bring :
Bachelor's degree in Computer Science, Engineering, or a related field4-7 years of experience in SRE, DevOps, automation, or infrastructure rolesHands-on experience with observability tools : Prometheus, Grafana, Splunk, OpenTelemetryProficient in scripting languages such as Python, Go, or BashExperience building CI / CD pipelines and integrating infrastructure telemetryWorking knowledge of Kubernetes, container operations, and cloud-native architecturesFamiliarity with Azure (preferred), AWS or GCPUnderstanding of incident response workflows, system health checks, and auto-remediationMust Have Skills :
Application & Microservice : Java, Spring boot, API & Service DesignAny CI / CD Tools : Gitlab Pipeline / Test Automation / GitHub Actions / Jenkins / Circle CIApp Platform : Docker & Containers (Kubernetes)Any Databases : SQL & NOSQL (Cassandra / Oracle / Snowflake / MongoDB)Any Messaging : Kafka, Rabbit MQAny Observability / Monitoring : Splunk / Grafana / Open Telemetry / ELK Stack / Datadog / New Relic / Prometheus)AIOps Skills : GitOps / ArgoCD / FluxNice To Have :
Fleet mgmt across EKS / AKS, Databricks integrationMeasure adoption (time-to-first-deploy)Mentor / coach product teamsMulti-cloud identity federation (OIDC, SPIFFE)Standardized compositions, lifecycle governance