Role Overview
We are looking for an experienced MLOps Lead with deep expertise in Azure and AWS cloud ecosystems , who can design, deploy, and manage scalable AI / ML infrastructure. The ideal candidate should bring a strong background in cloud governance, GenAI tooling, automation, and CI / CD pipelines , with hands-on experience across modern MLOps frameworks.
Key Responsibilities
- Design, implement, and manage scalable cloud-based AI / ML infrastructure across Azure and AWS .
- Drive end-to-end MLOps lifecycle — model deployment, monitoring, retraining, and governance.
- Enable GenAI and Agentic AI platforms leveraging Azure OpenAI, Bedrock, Anthropic Claude, LangChain, etc.
- Implement CI / CD pipelines using Azure DevOps or AWS CodePipeline.
- Ensure security, observability, and compliance across ML and GenAI ecosystems.
- Manage infrastructure automation via Terraform, Bicep, CloudFormation , or similar IaC tools.
- Collaborate with data science and engineering teams to optimize ML workflows, data pipelines, and API integrations.
- Implement monitoring and alerting using Grafana, Prometheus, Azure Monitor, and Application Insights.
- Oversee networking, identity management, and role-based access controls (IAM, RBAC) across clouds.
- Support model lifecycle management — drift monitoring, retraining, technical evaluation, and business validation.
Technical Skills & Expertise
Cloud & MLOps Platforms
Azure : Azure ML, Azure AI Services, Azure OpenAI, Azure Kubernetes Service (AKS), Databricks, Azure Search, Azure Blob, Cosmos DB, Azure SQL, Azure Functions, Azure Event Hub, Azure Resource Manager (ARM), Bicep.AWS : SageMaker, Bedrock, Lambda, DynamoDB, S3, RDS, Redshift, ECR, CloudFormation, CDK, KMS, EventBridge, Step Functions.AI / ML & Programming
Hands-on in Python , with exposure to TensorFlow, PyTorch, scikit-learn.Understanding of LLM tokenization, prompt injection risks, jailbreak prevention, and AI safety techniques.Familiarity with LangChain, LlamaCloud, AI Foundry , and related frameworks.Experience in model monitoring, retraining, and evaluation workflows.DevOps & Infrastructure
Expertise in CI / CD pipelines , containerization (Docker, Kubernetes) , and infrastructure automation .Strong in governance, audit logging, security policies (Azure Policy, AWS SCP, IAM).Deep understanding of networking, DNS, load balancers, VNets / VPCs, VPNs.Skilled in IaC tools – Terraform, Bicep, ARM, CloudFormation.Monitoring & Observability
Experience with Grafana, Prometheus, Application Insights, Log Analytics Workspaces, Azure Monitor.Security & Access Management
Understanding of Microsoft AD, least privilege principles, IAM, RBAC.Testing & Automation
Familiarity with unit testing and integration testing in CI / CD workflows (preferably Azure DevOps).Good to Have
Experience with Azure Bot Framework , M365 Copilot , and APIM .Exposure to code assistants such as GitHub Copilot, Cursor, Claude Code.Knowledge of Boto3 SDK (AWS Python) and TypeScript for IaC .Preferred Background
Strong background in cloud infrastructure engineering and machine learning operations .Proven ability to lead cross-functional teams and implement AI governance at scale.Excellent problem-solving, communication, and documentation skills.