High-Performance Computing (HPC) infrastructures provide users with dedicated compute resources to run computation-intensive workloads such as weather simulations, artificial intelligence (AI), and machine learning (ML). Each job submitted by a user may consist of multiple tasks that run concurrently on different nodes, often requiring shared access to intermediate or final data. To facilitate this, HPC systems typically use a Parallel File System (PFS) that allows data to be accessed across nodes. However, this same PFS is commonly shared among all users, meaning that multiple jobs access the storage system simultaneously. This shared usage can lead to I / O interference, where one user's job slows down due to competing I / O demands from other users, thereby affecting overall job execution time. To address this challenge, we are developing software that allows HPC infrastructure providers to provision isolated PFS instances for each user or job. This reduces interference by isolating I / O traffic. Additionally, we are designing our software to support dynamic performance scaling of PFS instances, integrate erasure-coded fault tolerance, and enable data tiering to object storage systems. If you are interested in contributing to this effort or would like to discuss it further, please reach out.
Key Responsibilities
Required Skills and Qualifications
Why Join Us?
Developer • Pune, Republic Of India, IN