ANSR is hiring for one of its clients.
About T-Mobile :
T-Mobile US, Inc. (NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.
TMUS Global Solutions :
TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.
TMUS India Private Limited operates as TMUS Global Solutions.
About the Role :
As a Senior AIOps Engineer, you will be a key member of the CFL Platform Engineering and Operations team you will help design and implement next-generation intelligent operations that support AI / ML platforms, LLM-based applications, and large-scale distributed systems. You’ll develop automation, observability, and remediation pipelines that enable predictive insights, reduce incident impact, and enhance the reliability of production environments.
This is a hands-on, technical role where you’ll work closely with SRE, DevOps, data, and platform teams to embed intelligent automation into core operational workflows.
What You’ll Do :
- Develop automation pipelines for anomaly detection, root cause analysis, and self-healing
- Build integrations between monitoring systems and AI / ML models for predictive alerting
- Engineer real-time observability pipelines (logs, metrics, traces) across distributed platforms
- Deploy and manage tools such as OpenTelemetry, Prometheus, Grafana, Splunk, and Datadog
- Extend telemetry coverage for LLM-based systems, APIs, and hybrid cloud environments
- Implement event-driven workflows for incident remediation and automated recovery
- Contribute to intelligent alerting standards, dashboarding, and escalation logic
- Collaborate with SRE and DevOps teams to define and implement reliability automation
- Document playbooks, remediation flows, detection rules, and AIOps patterns
- Partner with platform and data science teams on AIOps architecture and telemetry modeling
What You’ll Bring :
Bachelor's degree in Computer Science, Engineering, or a related field4-7 years of experience in SRE, DevOps, automation, or infrastructure rolesHands-on experience with observability tools : Prometheus, Grafana, Splunk, OpenTelemetryProficient in scripting languages such as Python, Go, or BashExperience building CI / CD pipelines and integrating infrastructure telemetryWorking knowledge of Kubernetes, container operations, and cloud-native architecturesFamiliarity with Azure (preferred), AWS or GCPUnderstanding of incident response workflows, system health checks, and auto-remediationMust Have Skills :
Application & Microservice : Java, Spring boot, API & Service DesignAny CI / CD Tools : Gitlab Pipeline / Test Automation / GitHub Actions / Jenkins / Circle CIApp Platform : Docker & Containers (Kubernetes)Any Databases : SQL & NOSQL (Cassandra / Oracle / Snowflake / MongoDB)Any Messaging : Kafka, Rabbit MQAny Observability / Monitoring : Splunk / Grafana / Open Telemetry / ELK Stack / Datadog / New Relic / Prometheus)AIOps Skills : GitOps / ArgoCD / FluxNice To Have :
Fleet mgmt across EKS / AKS, Databricks integrationMeasure adoption (time-to-first-deploy)Mentor / coach product teamsMulti-cloud identity federation (OIDC, SPIFFE)Standardized compositions, lifecycle governance