HPC System Administrator
Job Summary We are seeking an experienced High-Performance Computing (HPC) System Administrator to manage, maintain, and optimize large-scale HPC clusters and infrastructure. This role focuses on ensuring reliable system operations, implementing robust monitoring solutions, managing user environments, and maintaining high availability of compute resources for research and production workloads.
Key Responsibilities
- Install, configure, and maintain HPC cluster hardware and software components
- Manage job scheduling systems (SLURM, PBS, LSF) and optimize queue configurations
- Monitor system performance, resource utilization, and cluster health using monitoring tools
- Administer user accounts, permissions, and resource allocations across compute nodes
- Deploy and maintain software stacks, compilers, libraries, and scientific applications
- Implement and maintain backup strategies and disaster recovery procedures
- Troubleshoot hardware failures, network issues, and software conflicts
- Perform regular system updates, security patches, and maintenance windows
- Manage storage systems including parallel file systems (Lustre, GPFS, BeeGFS)
- Coordinate with vendors for hardware support and warranty services
- Create and maintain system documentation and operational procedures
Required Qualifications
Bachelor's degree in Computer Science, Information Technology, or related field6+ years of experience administering Linux-based HPC systemsStrong knowledge of Linux system administration (RHEL, CentOS, Ubuntu)Experience with job scheduling systems (SLURM preferred)Proficiency in shell scripting (Bash) and system automationKnowledge of networking concepts including InfiniBand and Ethernet fabricsExperience with configuration management tools (Ansible, Puppet, Chef)Understanding of parallel file systems and storage technologiesFamiliarity with HPC interconnects and high-speed networkingExperience with system monitoring tools (Nagios, Zabbix, Ganglia)Preferred Skills
Experience with container technologies (Singularity, Docker) in HPC environmentsKnowledge of virtualization technologies (KVM, VMware)Familiarity with cloud computing platforms and hybrid cloud deploymentsExperience with GPU computing and CUDA environmentsUnderstanding of MPI, OpenMP, and other parallel programming modelsKnowledge of security best practices for multi-user HPC environmentsExperience with database administration (MySQL, PostgreSQL)Familiarity with ticketing systems and user support workflowsCertification in relevant technologies (Red Hat, CompTIA, vendor-specific)Skills Required
Linux System Administration