Job Description :
We are seeking a highly skilled and experienced Machine Learning Operations Engineer to join our team. The ideal candidate will have a strong background in ML engineering, experience with Python, GitLab CI, Terraform, and AWS services. They will be responsible for productionizing models, implementing automation-first delivery, designing scalable serving, and establishing end-to-end observability.
Key Responsibilities :
- Productionize models, ensuring resilient services with clear service level objectives (SLOs), runbooks, and fast, safe rollbacks.
- Implement automation-first delivery : reproducible builds, layered tests, and environment promotion via GitLab CI and Terraform-based infrastructure as code (IaC).
- Design scalable serving : batch and real-time inference on Amazon Elastic Container Service for Kubernetes (EKS) / Amazon Elastic Container Service (ECS) / AWS Lambda and Databricks Model Serving with probes, autoscaling, and canary / blue-green deployments.
- Establish end-to-end observability (data, model, system); detect drift / regressions; lead incidents and post-mortems that drive durable fixes.
- Collaborate across teams to translate requirements into designs, architecture decision records (ADRs), and change plans; balance security, privacy, cost, and performance tradeoffs.
- Continuously reduce toil through automation, optimize model / GPU / LLM cost, and evolve templates / playbooks for repeatable delivery.
Minimum Qualifications :
Bachelor's degree in Computer Science, Engineering, Data Science, or a related field and 3+ years of relevant experience as outlined in the key responsibilities; or High School Diploma / General Education Degree and 6+ years of relevant experience as outlined in the key responsibilities in lieu of Bachelor's Degree.3+ years operating ML systems in production (MLOps).Experience with Python for ML engineering (packaging, typing, testing, performance)Experience developing GitLab CI for ML / GenAI (multi-stage pipelines, artifacts, evaluation / security gates) and Terraform for ML / GenAI (reusable modules, drift detection); secure packaging & containerization.Experience deploying and operating compute for ML (EKS / ECS / Lambda), and secure data access patterns (Amazon Simple Storage Service (S3) / Virtual Private Cloud (VPC) / Identity and Access Management (IAM) / Key Management Service (KMS), private endpoints)Experience implementing MLflow tracking, model registry & governed promotion, packaging & deployment to multi-target runtimes.Experience operating real-time + batch / streaming inference workloads, ML observability, layered testing (unit / integration), workflow orchestration, and cost optimization.Experience designing and implementing IAM least-privilege, secrets / key management for CI / CD pipelines; privacy and compliance awareness.Preferred Qualifications :
Advanced GitLab CI (dynamic child pipelines, components, cross-project triggers, security scans, compliance gates).Advanced Terraform (policy-as-code, gated plan / apply, environment promotion).Advanced real-time serving (multi-tenant routing, dynamic model loading) and SLO-driven rollback / automation.Databricks governance (Unity Catalog, lineage) and feature platform approval / reuse workflows.