We’re looking for a Machine Learning Observability Platform Engineer who’s passionate about building large-scale, reliable ML systems. You’ll help design and enhance our open-source observability platform , adding AI capabilities that power critical insights across enterprise environments.
What You’ll Do
- Build and maintain AI / ML features for an open-source Observability Platform built on Grafana and ClickHouse .
- Collaborate with SREs, service owners, and observability SMEs to ensure scalable, reliable ML model deployment.
- Design and manage data pipelines using Databricks and related tools.
- Use CI / CD and MLOps best practices to automate model deployment and testing.
- Deploy and manage ML infrastructure on AWS or Azure .
- Set up and integrate MCP servers and connect tools across observability systems.
- Establish prompt standards and develop custom MCP integrations between systems.
- Troubleshoot ML system performance and reliability using OpenTelemetry pipelines and observability metrics.
What We’re Looking For
Master’s degree in Computer Science, Engineering, or Artificial Intelligence (or equivalent experience).Proven experience designing, developing, and operating ML systems in production .Hands-on experience with LLMs / MCPs , Grafana , and Databricks .Strong coding skills in Python .Familiarity with Kubernetes , container orchestration , and cloud platforms (AWS / Azure).Solid understanding of observability pillars (metrics, logs, traces).Experience implementing OpenTelemetry pipelines for ML systems.Knowledge of CI / CD , MLOps , and monitoring best practices .