Role : ML OPS Lead Engineer_ (Machine Learning Operations Lead Engineer)
Job Mode : Remote
Experience : 7+ Years
Notice Period : Immediate / 10 to 15 Days
Role Overview
We are seeking a highly skilled ML Ops Lead Engineer with extensive experience in Machine Learning Operations, Cloud Infrastructure, and Generative AI platforms . The ideal candidate will have deep expertise in both Azure and AWS ecosystems , with a proven track record of designing, deploying, and maintaining scalable and secure ML solutions across hybrid environments.
Key Responsibilities
- Design, deploy, and manage scalable, secure, and reliable cloud-based ML infrastructures leveraging Azure and AWS services.
- Lead ML Ops initiatives to streamline model development, deployment, and monitoring pipelines .
- Collaborate with data scientists, ML engineers, and platform teams to operationalize ML models efficiently.
- Implement and maintain CI / CD pipelines for ML workflows using Azure DevOps or AWS CodePipeline .
- Drive governance, observability, compliance, and audit controls within ML and GenAI environments.
- Refine and enforce security best practices , including IAM, RBAC, Azure Policy, and AWS SCPs.
- Oversee AI evaluation, prompt security scans , and red teaming using Azure AI Evaluation SDK.
- Manage data storage, compute, and networking integrations across S3, DynamoDB, Cosmos DB, RDS, and Blob Storage .
- Build Infrastructure as Code (IaC) using Terraform, ARM / Bicep, CloudFormation , or equivalent tools.
- Implement monitoring and observability solutions using Grafana, Prometheus, Application Insights, and Azure Monitor.
- Support ML model lifecycle management deployment, monitoring, retraining, and drift detection.
- Collaborate with stakeholders to resolve ML pipeline issues and support model infrastructure needs.
Required Skills & Experience
7+ years of experience in platform engineering , ML Ops , or DevOps with cloud infrastructure expertise.Proficiency with Azure (Azure ML, Databricks, AKS, AI Services, Azure Search) and AWS (SageMaker, Bedrock, Lambda).Experience in Generative AI and Agentic AI ecosystems , including Azure OpenAI, AI Foundry, AI Hub, Bedrock, Anthropic Claude, OpenAI API, LlamaCloud, and LangChain .Strong understanding of token usage , prompt injection risks , jailbreak attempts , and mitigation techniques .Expertise in Azure DevOps / AWS CodePipeline for ML CI / CD automation.Proficient in Azure Blob Storage, Cosmos DB, Key Vault, AWS S3, RDS, DynamoDB , and integrations with AI services.Advanced understanding of networking (DNS, load balancing, VPNs, VNets) and security concepts (IAM, policies, encryption).Proficiency in Infrastructure as Code (IaC) Azure ARM / Bicep, Terraform, or CloudFormation.Knowledge of Python (with AI / ML libraries like TensorFlow, PyTorch, Scikit-learn) and scripting in Bash / PowerShell .Experience with containerization and orchestration using Docker and Kubernetes .Familiarity with Azure Bot Framework, API Management, Application Gateway , and M365 Copilot .Working knowledge of monitoring and logging tools such as Grafana, Prometheus, and Azure Log Analytics.ML Engineering & Model Lifecycle Expertise
Hands-on experience with Azure Machine Learning Studio, Python SDK (v2), and CLI (v2) for ML model management.Understanding of ML / DL algorithms , model training , evaluation , and deployment workflows .Practical exposure to CI / CD orchestration for data science pipelines and post-deployment model monitoring .Experience enabling production-grade ML models , including drift monitoring , model retraining , and business validation .Security & Governance
Familiarity with Microsoft Active Directory (AD) and principle of least privilege for RBAC enforcement.Experience applying unit testing , integration testing , and CI / CD best practices within ADO.Cloud-Specific Expertise
AWS
Proficiency in AWS services RDS, DynamoDB, Redshift, Aurora, EC2, EBS, EFS, Lambda, SQS, SNS, EventBridge, Step Functions, KMS, ECR.Strong experience with AWS CloudFormation, CDK , and Python (Boto3) SDK.Azure
Expertise in Azure databases (Cosmos DB, Azure SQL Serverless), compute services (VMs, Scale Sets), and serverless components (Functions, Event Grid / Hub, Service Bus, Queue Storage).Experience managing Azure AKS / ACR , Azure Machine Learning , Azure Data Lake , Azure Key Vault , and ARM / Bicep templates.