Talent.com
MLOps & LLMOps Engineer

MLOps & LLMOps Engineer

Catalyst IQChennai
8 days ago
Job description

We are looking for a highly skilled MLOps & LLM Ops Engineer with strong expertise in deploying, automating, and monitoring AI / ML models - including Large Language Models (LLMs) - in production environments. The ideal candidate will have hands-on experience in CI / CD automation, container orchestration, data pipelines, LangChain, and cloud deployment across Azure / AWS. You will collaborate with data scientists, ML engineers, and customer architects to ensure seamless end-to-end delivery of scalable, high-performing AI systems.

Key Responsibilities :

1. Model Deployment & Automation :

  • Automate the full lifecycle of AI / ML model deployment, including packaging, orchestration, scaling, and rollout strategies.
  • Implement automated workflows for data, model versioning, and experiment tracking using tools like MLflow or similar systems.
  • Deploy Large Language Models (LLMs) to production using frameworks such as LangChain, Flask, FastAPI, or custom Containerize and orchestrate model services using Docker & Kubernetes, enabling highly available and fault-tolerant inference pipelines.

2. CI / CD & Infrastructure Automation :

  • Build and maintain robust CI / CD pipelines using Git, Jenkins, GitHub Actions, or GitLab CI for continuous integration, testing, and deployment of ML solutions.
  • Implement infrastructure-as-code (IaC) for automated provisioning of cloud resources (Terraform or equivalent).
  • Automate deployment workflows for API endpoints, microservices, feature stores, and data processing pipelines.
  • 3. Data Pipelines & Real-Time Processing :

  • Design, deploy, and manage data ingestion and processing pipelines using Airflow, Kafka, and RabbitMQ.
  • Ensure reliable, scalable, and secure data pipelines that support both training and inference workflows.
  • Optimize data freshness, batch scheduling, and streaming performance for high-throughput model operations.
  • 4. LLM & Foundation Model Operations :

  • Integrate and operationalize foundation model APIs such as OpenAI, Anthropic, Gemini, Cohere, etc.
  • Deploy custom or fine-tuned LLMs (GPT, Llama, Mistral, etc.) using LangChain or custom inference frameworks.
  • Implement prompt management, evaluation, caching, vector store integrations, and retrieval-augmented generation (RAG)
  • pipelines.

  • Ensure high performance, low latency, and reliability of LLM-based production systems.
  • 5. Cloud Deployment & Infrastructure Management :

  • Deploy ML workloads in Azure or AWS using services like Kubernetes (AKS / EKS), Lambda, EC2, S3 / ADLS, API Gateway, Azure
  • Functions, etc.

  • Monitor and optimize infrastructure cost, performance, and scalability for ML and LLM systems.
  • Collaborate with customer architects to define, plan, and execute end-to-end deployments and solution architectures.
  • 6. Monitoring, Observability & Performance Optimization :

  • Implement and maintain observability stacks for model performance monitoring, including :
  • Latency, throughput, drift detection
  • Model accuracy and quality metrics
  • Resource utilization, autoscaling behavior
  • Use tools like Prometheus, Grafana, ELK, Datadog, or cloud-native monitoring solutions.
  • Troubleshoot production issues and perform root cause analysis across models, pipelines, and Skills & Qualifications :
  • Strong hands-on experience in MLOps, production ML workflows, and automation.
  • Expertise in CI / CD tools (Git, Jenkins, GitHub Actions, GitLab CI).
  • Strong experience with Docker and Kubernetes for model containerization and deployment.
  • Practical knowledge of MLflow, LangChain, and experiment tracking / versioning systems.
  • Experience with Airflow, Kafka, RabbitMQ for large-scale data workflow orchestration.
  • Experience working with foundation model APIs (OpenAI, Anthropic, etc.).
  • Hands-on deployment experience on Azure and / or AWS cloud platforms.
  • Familiarity with performance monitoring tools (Prometheus, Grafana, Datadog, CloudWatch, etc.).
  • Solid understanding of distributed systems, microservices, and cloud-native architectures.
  • Strong communication, analytical, and debugging skills.
  • Ability to work in fast-paced environments and manage complex deployments.
  • Preferred (Nice-to-Have) :

  • Knowledge of vector databases (Pinecone, Weaviate, FAISS, Chroma).
  • Experience with RAG pipelines, semantic search, embeddings, or LLM orchestration frameworks.
  • Exposure to model optimization techniques such as quantization, distillation, or low-latency inference optimization.
  • Hands-on experience with Terraform, Helm, or ArgoCD.
  • Experience with GPU-based deployments and optimization in cloud platforms.
  • (ref : hirist.tech)

    Create a job alert for this search

    Mlops Engineer • Chennai