Talent.com
This job offer is not available in your country.
HPC L3

HPC L3

Yotta Data Services Private LimitedPanvel, Maharashtra, India
6 days ago
Job description

Job Scope :

As an HPC Admin , you will be responsible for the management and maintenance of GPU Supercomputing clusters on NVIDIA reference architecture. You will ensure optimal performance and uptime of these critical systems, supporting high-performance computing (HPC) requirements.

Job Responsibilities :

  • Configure, and maintain GPU Supercomputing clusters and associated networking configuration.
  • Implement and optimize software stacks including MaaS (metal-as-a-service), Job Scheduler (SLURM / PBS), Cloud Orchestration (Kubernetes), and Network Management (NetQ for Ethernet fabric and UFM for InfiniBand).
  • Conduct performance activities such as debugging, profiling, benchmarking, and tuning of GPU applications on large-scale supercomputing clusters.
  • Run benchmarking applications from widely used platforms such as MLPerf Training & Inference, AI Training (PyTorch, TensorFlow, NeMo, Megatron-LM), and AI Inference (TensorRT-LLM, Triton Inference Server, vLLM).

Must-Have Skill :

  • Hands-on experience with NVIDIA GPU, particularly NVIDIA Data Centre GPUs (A100 / H100)
  • Experience in provisioning and managing software stacks like MaaS, Job Scheduler (SLURM / PBS), Cloud Orchestration (Kubernetes), and Network Management (NetQ for Ethernet fabric and UFM for InfiniBand).
  • Prior experience collaborating with NVIDIA Solution Architect & Engineering teams on large-scale GPU-as-a-service projects.
  • Familiarity with benchmarking applications from widely used platforms and frameworks, including MLPerf, PyTorch, TensorFlow, NeMo, Megatron-LM, TensorRT-LLM, Triton Inference Server, and vLLM.
  • Experience in performance engineering, including debugging, profiling, benchmarking, and tuning various GPU applications on large-scale supercomputing clusters.
  • Good to Have Skill :

  • Knowledge of other HPC technologies and architectures beyond NVIDIA, broadening expertise in the field.
  • Experience with other cloud platforms and orchestration tools, expanding versatility in deployment environments.
  • Strong problem-solving and troubleshooting abilities, enabling quick resolution of complex technical issues.
  • Excellent communication and collaboration skills to work effectively within cross-functional teams and with external partners.
  • Behavioral Attributes :

  • Strong problem-solving skills with a proactive and solution-oriented approach.
  • Excellent communication and collaboration skills for effective customer support.
  • Adaptability to handle a dynamic and fast-paced cloud administration environment.
  • Commitment to security best practices and continuous improvement.
  • Qualification and Experience :

  • Bachelor’s degree in engineering, or equivalent.
  • Minimum 5 + years’ experience in IT, 5+ years of relevant experience in HPC engineering roles, with a focus on NVIDIA GPU and Networking Technologies.
  • Demonstrated success in deploying and managing large-scale GPU Supercomputing clusters, preferably in collaboration with NVIDIA teams.
  • Proven track record of performance engineering activities and optimizing GPU applications for high-performance computing workloads.
  • Create a job alert for this search

    L3 • Panvel, Maharashtra, India