HPC System Administration & Troubleshooting :
- Manage and optimize HPC clusters, ensuring high availability and performance.
- Troubleshoot GPU, CPU, network drivers, firmware, and OS-level issues.
- Debug storage, networking, and job scheduling bottlenecks in Slurm-based environments.
Kubernetes & Cloud HPC Environments :
Deploy and manage HPC workloads in Kubernetes for AI / ML and parallel computing.Optimize OpenStack-based HPC clusters with Ceph, Cinder, and Neutron for cloud scalability.Implement containerized HPC workflows using Kubernetes and OpenShift.Automation & Infrastructure as Code (IaC) :
Develop Ansible and Terraform scripts for provisioning and managing HPC resources.Automate job scheduling, cluster monitoring, and log analysis using Python.Optimize CI / CD pipelines for HPC and AI / ML applications.Performance Tuning & Benchmarking :
Benchmark and optimize multi-node HPC workloads (MPI, NCCL, ROCm, CUDA).Tune OS parameters, networking (InfiniBand, RoCE), and Slurm configurations for peak performance.Enhance HPC storage performance (Ceph, Lustre, NFS) and distributed computing efficiency.Client Support & Collaboration :
Provide real-time technical support and troubleshooting for HPC users.Engage with developers, DevOps, and system administrators to optimize cluster performance.Document solutions, best practices, and contribute to internal knowledge bases.PREFERRED QUALIFICATIONS :
Experience with AMD MI300, MI2X0 GPUs, ROCm, MPI, UCX, or XPMEM.Exposure to containerized workloads using Singularity or Docker in HPC.Familiarity with OpenStack deployment automation (e.g., TripleO, Kolla, or OpenStack-Ansible).Experience in customer-facing technical roles, with a strong ability to troubleshoot live issues.Skills Required
systems design , Ansible, Openstack, Hpc, Kubernetes