Role : ML OPS Lead Engineer
Job Mode : Remote
Experience : 7+ Years
Notice Period : Immediate / 10 to 15 Days
Experience Required :
7+ years in platform or infrastructure engineering with significant experience in ML Ops, AI, and Cloud (Azure & AWS).
Key Responsibilities :
- Design, deploy, and manage scalable, secure, and high-performing cloud-based infrastructures across Azure and AWS.
- Lead end-to-end ML Ops lifecycle , including model deployment, monitoring, retraining, and CI / CD integration.
- Collaborate with AI / ML, Data Science, and DevOps teams to automate model lifecycle management and streamline ML workflows.
- Architect and implement governance, compliance, observability, and security frameworks for ML and GenAI systems.
- Drive innovation in Generative AI and Agentic AI ecosystems , integrating services like Azure OpenAI, Bedrock, Anthropic Claude, and OpenAI API.
- Implement infrastructure-as-code (IaC) practices using Terraform, Bicep, ARM, or CloudFormation .
- Manage networking, IAM, and security configurations across Azure and AWS environments.
- Establish monitoring, alerting, and performance dashboards using Grafana, Prometheus, Azure Monitor, and Log Analytics .
Required Technical Skills :
Cloud Platforms :
Azure : Azure AI Services, Azure Search, Azure ML, Databricks, AKS, Azure AI Foundry, Azure AI Hub.AWS : SageMaker, Bedrock, Lambda, ECS, CDK, CloudFormation.AI / ML & Generative AI :
Exposure to Generative and Agentic AI ecosystems (Azure OpenAI, Bedrock, Claude, LlamaCloud, LangChain).Understanding of token usage, prompt injection, jailbreak risks , and mitigation methods.Experience with Azure AI Evaluation SDK and AI Red Teaming Prompt Security Scans .Hands-on experience with Python ML libraries (TensorFlow, PyTorch, Scikit-learn).DevOps & Automation :
Strong experience with Azure DevOps / AWS CodePipeline for CI / CD setup and management.Familiarity with Docker , Kubernetes , and container orchestration.Knowledge of IaC tools (Terraform, ARM / Bicep, CloudFormation).Database & Storage :
Azure Blob Storage, Cosmos DB, SQL, Key Vault, Data Lake Storage.AWS S3, DynamoDB, RDS, Redshift, Aurora.Understanding of OLTP and OLAP systems .Networking & Security :
Proficiency in DNS, VPNs, Load Balancing, VNets, IAM , and access control (RBAC, SCP, Azure Policy).Familiarity with Microsoft AD and principles of least privilege.Hands-on with KMS , Key Vault , and identity governance best practices.ML Engineering & Workflow Management :
Experience using Azure Machine Learning Studio, SDK (v2), CLI (v2) for model monitoring, retraining, and deployment.Build and optimize end-to-end ML workflows for production environments.Implement drift monitoring , model retraining , and technical & business validation processes.Collaborate with data scientists for model deployment and performance optimization.Additional Skills (Good to Have) :
Experience with code assistant tools (GitHub Copilot, Cursor, Claude Code).Familiarity with Azure Bot Framework, APIM, Application Gateway .Exposure to M365 Copilot and related ecosystem tools.Proficiency with AWS Python SDK (Boto3) and AWS CDK .Testing & Quality :
Implement unit and integration testing in CI / CD workflows (preferably using ADO).Ensure testing and validation coverage for ML pipelines and infrastructure deployments.Preferred Qualifications :
Bachelor s or Master s in Computer Science, Information Technology, or related field.Certification(s) in Azure AI Engineer, AWS Machine Learning Specialty , or DevOps highly desirable.