Talent.com
This job offer is not available in your country.
HPC Team Lead

HPC Team Lead

SHI | Locuz - An SHI CompanyHyderabad, Telangana, India
18 hours ago
Job description

Hi,

We have an immediate requirement for HPC Team Lead position in Hyderabad with our organization SHI Locuz Enterprise Solutions Pvt Ltd.

PFB JD : Experience - 6+years

Work location - Hyderabad

ROLE SUMMARY

The Technology Lead – HPC ensures that critical IT services and high-performance computing (HPC) infrastructure are available, efficient, and secure. The person in this role manages daily operations of mission‐critical systems in multiple client’s data centres, working closely with both facilities engineering teams (power, cooling, physical infrastructure) and IT infrastructure / operations teams, to support service clients around the clock. This role combines technical leadership, operations oversight, incident / problem management, and strategic planning.

PRIMARY ROLES & RESPONSIBILITIES

Experience architecting and maintaining HPC / AI systems.

Linux system administration

Cluster management

System and software configuration management

High speed networking

Resource managers and schedulers

High speed parallel storage

Monitoring and alerting

Strong understanding of HPC / AI architectures and concepts.

Experience supporting and managing a group of HPC / AI Clusters.

Excellent knowledge in prototyping and deploying HPC / AI clusters.

Extensive experience in troubleshooting Linux OS, filesystems and cluster hardware.

Good command of various Linux scripting tools, like bash, Perl, python, etc.

Experience implementing, maintaining, and verifying defined security policies.

To be willing to maintain a flexible work schedule.

A positive attitude and willingness to help enable the lab users for success.

Excellent guidance and teamwork skills.

TECHNICAL SKILLS

RedHat, Ubuntu, SuSE OS

Cluster Tools (Bright, xCAT, werewolf, OpenHPC, ROCKS etc)

InfiniBand

Lustre, BeeGFS and GPFS architecture and maintenance

Configuration management software (Ansible, Puppet)

SLURM / PBS / LSF / Gridengine Scheduler

SPACK software manager

Experience in AI Servers & Software stack Deployment.

Experience on container technologies and orchestration tools - docker, singularity, Apptainer, Kubernetes.

Hands-on with AI / ML tools : TensorFlow, PyTorch, Keras, ONNX, JAX.

Experience in benchmarking and performance optimization of large-scale HPC / AI systems

Experience in Linux, and / or Windows Operating System (OS), including file management, scripting, editing, and security.

Log consolidation and monitoring (ganglia, Grafana etc.)

Lifecycle and patch management experience.

SOFT SKILLS

Good logical reasoning & analytical skill

Good communication skill

OTHER SKILLS

Collaborative, co-operative, and commitment mindset.

Teamwork

Excellent analytical and problem-solving skills.

Ability to work independently and within cross-functional teams.

Detail-oriented with good documentation practices.

Excellent interpersonal, communication, customer interaction, documentation skills and decision-making ability.

Create a job alert for this search

Team Lead • Hyderabad, Telangana, India