Hi,
We have an immediate requirement for HPC Team Lead position in Hyderabad with our organization SHI Locuz Enterprise Solutions Pvt Ltd.
PFB JD : Experience - 6+years
Work location - Hyderabad
ROLE SUMMARY
The Technology Lead – HPC ensures that critical IT services and high-performance computing (HPC) infrastructure are available, efficient, and secure. The person in this role manages daily operations of mission‐critical systems in multiple client’s data centres, working closely with both facilities engineering teams (power, cooling, physical infrastructure) and IT infrastructure / operations teams, to support service clients around the clock. This role combines technical leadership, operations oversight, incident / problem management, and strategic planning.
PRIMARY ROLES & RESPONSIBILITIES
Experience architecting and maintaining HPC / AI systems.
Linux system administration
Cluster management
System and software configuration management
High speed networking
Resource managers and schedulers
High speed parallel storage
Monitoring and alerting
Strong understanding of HPC / AI architectures and concepts.
Experience supporting and managing a group of HPC / AI Clusters.
Excellent knowledge in prototyping and deploying HPC / AI clusters.
Extensive experience in troubleshooting Linux OS, filesystems and cluster hardware.
Good command of various Linux scripting tools, like bash, Perl, python, etc.
Experience implementing, maintaining, and verifying defined security policies.
To be willing to maintain a flexible work schedule.
A positive attitude and willingness to help enable the lab users for success.
Excellent guidance and teamwork skills.
TECHNICAL SKILLS
RedHat, Ubuntu, SuSE OS
Cluster Tools (Bright, xCAT, werewolf, OpenHPC, ROCKS etc)
InfiniBand
Lustre, BeeGFS and GPFS architecture and maintenance
Configuration management software (Ansible, Puppet)
SLURM / PBS / LSF / Gridengine Scheduler
SPACK software manager
Experience in AI Servers & Software stack Deployment.
Experience on container technologies and orchestration tools - docker, singularity, Apptainer, Kubernetes.
Hands-on with AI / ML tools : TensorFlow, PyTorch, Keras, ONNX, JAX.
Experience in benchmarking and performance optimization of large-scale HPC / AI systems
Experience in Linux, and / or Windows Operating System (OS), including file management, scripting, editing, and security.
Log consolidation and monitoring (ganglia, Grafana etc.)
Lifecycle and patch management experience.
SOFT SKILLS
Good logical reasoning & analytical skill
Good communication skill
OTHER SKILLS
Collaborative, co-operative, and commitment mindset.
Teamwork
Excellent analytical and problem-solving skills.
Ability to work independently and within cross-functional teams.
Detail-oriented with good documentation practices.
Excellent interpersonal, communication, customer interaction, documentation skills and decision-making ability.
Team Lead • Hyderabad, Telangana, India