As a Lead MLOps Engineer, you will play a pivotal role in building scalable and reliable machine learning infrastructure for enterprise-grade applications. We are looking for a Lead Data Engineer with strong exposure to MLOps practices, ideally someone with a core data engineering background who has worked on large-scale data platforms. This is a hybrid role that blends big data engineering with end-to-end model lifecycle management- from development and deployment to monitoring and retraining. The ideal candidate will bring hands-on experience with Databricks, PySpark, and the orchestration of production-grade ML pipelines, enabling efficient and resilient solutions in dynamic, data-driven environments.
Roles and Responsibilities
- Design and implement distributed data processing pipelines using PySpark.
- Collaborate with business architects and stakeholders to design scalable data and ML workflows.
- Optimize performance of Spark applications through tuning, resource management, and caching strategies.
- Debug long-running Spark jobs using Spark UI;
address OOM errors, data skew, shuffle issues, and job retries.
Manage model deployment workflows using tools like MLflow for tracking, versioning, and registry.Build and maintain CI / CD pipelines for both data and ML workflows.Containerize applications using Docker and orchestrate using tools like Kubernetes.Monitor production models, manage retraining workflows, and handle dependency management.Contribute to clean, collaborative Git workflows with practices such as branching, rebasing, and PR reviews.Work across teams to ensure models are production-ready, scalable, and aligned with business goals.Develop and orchestrate big data workflows on Databricks.Work on at least one cloud platform (preferably Azure) for scalable data and ML solutions.Required Skills and Experience :
Proficient in PySpark, with strong experience in Spark performance tuning and optimization.Strong expertise in Databricks for development, orchestration, and job monitoring.Working knowledge of MLflow or similar tools for model lifecycle management.Proficient in Python and SQL.Deep understanding of distributed data systems, job scheduling, and fault tolerance.Experience in working with structured / unstructured data formats like Parquet, Delta, and JSON.Familiarity with feature stores, model monitoring, drift detection, and automated retraining workflows.Strong command over Git and version control in multi-developer environments.Experience with CI / CD tools for data and ML pipelines.Knowledge of containerization (Docker) and orchestration (Kubernetes) is a plus.Experience with at least one major cloud platform (Azure preferred, or AWS / GCP).