Talent.com
This job offer is not available in your country.
SMTS Systems Design Eng.

SMTS Systems Design Eng.

ConfidentialHyderabad / Secunderabad, Telangana
30+ days ago
Job description

HPC System Administration & Troubleshooting :

  • Manage and optimize HPC clusters, ensuring high availability and performance.
  • Troubleshoot GPU, CPU, network drivers, firmware, and OS-level issues.
  • Debug storage, networking, and job scheduling bottlenecks in Slurm-based environments.

Kubernetes & Cloud HPC Environments :

  • Deploy and manage HPC workloads in Kubernetes for AI / ML and parallel computing.
  • Optimize OpenStack-based HPC clusters with Ceph, Cinder, and Neutron for cloud scalability.
  • Implement containerized HPC workflows using Kubernetes and OpenShift.
  • Automation & Infrastructure as Code (IaC) :

  • Develop Ansible and Terraform scripts for provisioning and managing HPC resources.
  • Automate job scheduling, cluster monitoring, and log analysis using Python.
  • Optimize CI / CD pipelines for HPC and AI / ML applications.
  • Performance Tuning & Benchmarking :

  • Benchmark and optimize multi-node HPC workloads (MPI, NCCL, ROCm, CUDA).
  • Tune OS parameters, networking (InfiniBand, RoCE), and Slurm configurations for peak performance.
  • Enhance HPC storage performance (Ceph, Lustre, NFS) and distributed computing efficiency.
  • Client Support & Collaboration :

  • Provide real-time technical support and troubleshooting for HPC users.
  • Engage with developers, DevOps, and system administrators to optimize cluster performance.
  • Document solutions, best practices, and contribute to internal knowledge bases.
  • PREFERRED QUALIFICATIONS :

  • Experience with AMD MI300, MI2X0 GPUs, ROCm, MPI, UCX, or XPMEM.
  • Exposure to containerized workloads using Singularity or Docker in HPC.
  • Familiarity with OpenStack deployment automation (e.g., TripleO, Kolla, or OpenStack-Ansible).
  • Experience in customer-facing technical roles, with a strong ability to troubleshoot live issues.
  • Skills Required

    systems design , Ansible, Openstack, Hpc, Kubernetes

    Create a job alert for this search

    System Design • Hyderabad / Secunderabad, Telangana