Job descriptionOverview We are seeking a passionate and hands-on AI/ML Engineer to accelerate our Enterprise Observability strategy. This role will design, build, and operationalize AI/ML capabilities that enhance end to end telemetry pipelines, anomaly detection, intelligent alerting, and proactive system resiliency. You will work at the intersection of AI/ML engineering, Observability platforms, and automation, developing solutions that improve detection, diagnosis, and prevention of operational issues across distributed systems. ________________________________________ Key Responsibilities • Design and deploy AI/ML models supporting anomaly detection, baselining, event correlation, and predictive operational analytics. • Build and integrate AI‑enabled capabilities into enterprise Observability platforms, including Grafana, APM/RUM tools, network telemetry systems, and data observability tools. • Develop AI Agents that can autonomously triage issues, recommend corrective actions, and initiate automated remediation workflows to reduce recovery time and improve system resilience. • Implement self‑healing automation using AI‑driven decisioning, integrating with orchestration frameworks, service APIs, and infrastructure automation pipelines. • Engineer and maintain real‑time and batch data pipelines using Snowflake ML Jobs, Snowflake Cortex, streams, tasks, and UDFs. • Implement and manage OpenTelemetry‑based telemetry ingestion for logs, metrics, traces, and spans across distributed systems. • Build asynchronous Python APIs and services for model inferencing and operational integration. • Enhance observability intelligence with AI-powered capabilities such as root‑cause acceleration, chatbot/search enablement, and automated insights. • Contribute to SLO/SLI modeling, Golden Signals instrumentation, and Observability NFR adoption. • Collaborate across engineering, SRE, platform and business teams to embed proactive intelligence and Observability standards throughout the ecosystem.
Required Skills & Qualifications Core Technical Skills • Strong proficiency in Python and data science/ML libraries: NumPy, Pandas, scikit learn, TensorFlow, PyTorch, Matplotlib, Seaborn. • Experience with Generative AI, LLM fine tuning, prompt engineering, RAG pipelines, and LLM evaluation frameworks. • Expertise in developing and deploying ML models in production (batch & streaming). • Strong understanding of statistics, time series modeling, and anomaly detection.
Observability & Telemetry • Experience with OpenTelemetry for logs, metrics, traces, spans. • Familiarity with Observability concepts: Golden Signals, SLO/SLI design, APM, RUM, Synthetics, event correlation, baselining. • Experience with Observability tools such as: Grafana (Alloy agents, dashboards, ML capabilities), Dynatrace, Monte Carlo (Data Observability), Netscout, ThousandEyes, SolarWinds, NetBrain.
Cloud, Data & Platform • Hands on with AWS (SageMaker, Bedrock), Snowflake ML, Snowflake/Openflow, Snowflake AI Observability tooling. • Experience building Snowflake data pipelines (streams, tasks, UDFs) – plus for Cortex features. • Strong understanding of distributed systems and microservices telemetry requirements.
Automation & Engineering Quality • Experience with automation pipelines, CI/CD, and infrastructure as code patterns supporting Observability adoption. • Ability to build asynchronous Python APIs or services for model inference and operational integration. ________________________________________ Preferred Qualifications • Experience developing agentic AI systems that analyze telemetry, generate action recommendations, or execute automated operational responses. • Experience building self‑healing patterns, including automated rollback, service restarts, configuration corrections, and predictive maintenance. • Experience in Snowflake ML workflows, Snowflake Cortex Agents, and data pipeline automation. • Exposure to AI-enabled alerting, RCA automation, and operational self‑healing concepts. • Experience with large-scale operational telemetry and multi-cloud ecosystems.
Soft Skills • Strong analytical thinking and problem solving. • Excellent communication skills for cross functional collaboration with infrastructure, SRE, engineering, business, and leadership teams. • Curiosity, continuous learning mindset, and passion for applied AI and Observability.