Company Description
Concept Information Technologies (I) Pvt. Ltd., headquartered in Pune, is a leading IT solutions and system integration partner. We deliver scalable and cost-effective solutions in High-Performance Computing, Disaster Recovery, Enterprise Networking, Cybersecurity and Software Development. Backed by partnerships with HPE, Cisco, IBM and others, we combine industry expertise with advanced technology to drive measurable business value.
📍 Location : Pune (with occasional project-based travel to Mumbai, Bengaluru, or Hyderabad)
Role Description
We’re seeking an experienced HPC Administrator to design, deploy and manage large-scale high-performance computing environments. This role requires deep technical expertise, hands-on cluster management experience and the ability to optimize performance for demanding workloads.
Key Responsibilities :
- Design, deploy, and maintain HPC clusters and supporting infrastructure.
- Manage schedulers such as SLURM , PBS Pro , or LSF .
- Manage job queues, partitions, and scheduling policies to ensure efficient workload distribution.
- Support user issues related to job submissions, resource requests, and job scripts.
- Proficient with Docker, container technologies, Kubernetes
- Maintain containerized environments using Docker and Enroot.
- Optimize cluster performance and resource utilization.
- Develop and manage workflows for submitting container-based jobs to compute clusters using SLURM or similar job schedulers.
- Demonstrated proficiency in support and troubleshooting of 3rd party HPC software
- Compiling and deploying open source software and software.
- Integrate GPU-based workloads with SLURM , PBS , or similar job scheduling systems.
- Understanding of MPI, Intel MPI
- Understanding of different User authentication methods like, IPA / IDM, NIS, LDAP
- Expert knowledge of related parallel distributed file system like Lustre / IBM GPFS / BGFS,
- Implement backup, disaster recovery, and monitoring solutions.
- Ability to deploy open-source and commercial HPC Platforms,
- Support application teams with MPI libraries, parallel processing, and GPU setups.
- Automate repetitive tasks through scripting.
- Create and maintain detailed technical documentation.
- Mentor junior team members and collaborate on solution design.
Required Skills :
7-8 years of experience in HPC administration or Linux systems engineering .Strong expertise in cluster management, tuning, and performance optimization .Experience with storage systems , networking (InfiniBand, Ethernet) and monitoring tools (Grafana, Prometheus).Proficiency in shell scripting , Python , or automation tools.Knowledge of MPI , CUDA , or GPU computing is an advantage.Excellent troubleshooting and communication skills.