Description : About the Role :
We are seeking an experienced DevOps Engineer to join our infrastructure team, with a strong focus on managing and optimizing GPU-based compute environments for machine learning and deep learning workloads.
In this role, you will be responsible for the end-to-end infrastructure lifecyclefrom provisioning with Terraform / Ansible to deploying ML models using modern frameworks like Hugging Face and Ollama.
Key Responsibilities :
- Manage infrastructure using Terraform and Ansible
- Deploy and monitor Kubernetes clusters with GPU support (including NVIDIA drivers and H100 SXM integration)
- Implement and manage inferencing frameworks such as Ollama, Hugging Face, etc.
- Support containerization (Docker), logging (EFK), and monitoring (Prometheus / Grafana)
- Handle GPU resource scheduling, isolation, and scaling for ML / DL workloads
- Collaborate closely with developers, data scientists, and ML engineers to streamline deployments and performance
Required Skill Set :
5- 8 years of hands-on experience in DevOps and infrastructure automationProven experience in managing GPU-based compute environmentsStrong understanding of Docker, Kubernetes, and Linux internalsFamiliarity with GPU server hardware and instance typesProficient in scripting with Python and BashGood understanding of ML model deployment, inferencing workflows, and resource utilization / meteringNice to Have :
Experience with AI / ML pipelinesKnowledge of cloud-native technologies (AWS / GCP / Azure) supporting GPU workloadsExposure to model performance benchmarking and A / B testing(ref : hirist.tech)