Description : AI Ops Engineer - (ML Ops & LLM Ops).
Location : Hyderabad or Remote (within India).
Experience : 11- 14 Years.
IMMEDIATE JOINERS PREFERRED FROM IT SERVICES ORGANIZATION.
Role Overview :
We are looking for experienced AI Ops Engineers with deep expertise in MLOps, LLM deployment, and AI infrastructure management.
The ideal candidate will design and operate robust pipelines that support the full lifecycle of Large Language Models (LLMs) from training and fine-tuning to production deployment, monitoring, and optimization.
Key Responsibilities :
- Design, implement, and maintain CI / CD pipelines for LLM training, fine-tuning, evaluation, and deployment.
- Integrate tools like SonarQube and Checkmarx to enforce code quality and security standards.
- Establish comprehensive versioning for models, datasets, prompts, and configurations to ensure traceability.
- Package and deploy LLMs using Docker and Kubernetes for scalable, consistent runtime environments.
- Set up end-to-end monitoring for model performance metricslatency, throughput, cost, and output quality (hallucination, coherence, safety).
- Implement alerting mechanisms to detect anomalies, performance degradation, and model drift.
- Manage and fine-tune cloud infrastructure (AWS, Azure) and GPU / TPU environments for optimal performance.
- Use Terraform or CloudFormation for automated environment provisioning and configuration management.
- Apply cost optimization strategies for LLM inference and serving while maintaining reliability.
- Architect systems for high availability, fault tolerance, and resilience in AI workloads.
- Diagnose and resolve infrastructure or model-related issues in production environments.
- Contribute to frameworks ensuring model explainability, fairness, and traceability.
- Automate data ingestion, retraining triggers, and pipeline orchestration using modern MLOps tools.
- Build and manage complex LLM workflows through orchestration platforms for efficient end-to-end operations.
- Continuously monitor and address model degradation, data drift, and other operational risks.
Required Skills & Experience :
10+ years in DevOps, ML Ops, or AI Infrastructure roles.Strong hands-on experience with LLM deployment, MLOps frameworks, and cloud platforms (AWS, Azure).Proficiency in Docker, Kubernetes, Terraform, and CI / CD tools (Jenkins, GitLab CI / CD, etc.)Deep understanding of LLM lifecycle management, performance tuning, and observability.Knowledge of security and compliance for AI systems.Experience with GPU / TPU optimization and cost-efficient scaling.Proven problem-solving and incident management abilities.Strong communication and cross-functional collaboration skills.Exposure to generative AI, prompt engineering, or RLHF pipelines.Familiarity with LLM-specific monitoring tools and safety frameworks.Open-source contributions in MLOps or AI Ops are a plus.Certifications in Cloud (AWS / Azure) or DevOps practices preferred.(ref : hirist.tech)