L3 Storage (VAST) Specialist
Location : Chennai
Employment Type : Full-Time
Experience : Relevant expertise in VAST Data storage platforms
About Us
We are a well-funded stealth AI startup building next-generation AI infrastructure and high-performance data systems. To support our fast-scaling environment, we are looking for experienced L3 Storage (VAST) Specialists to join our core infrastructure team.
Role Overview
As an L3 Storage (VAST) Specialist, you will manage, operate, troubleshoot, and optimize our VAST clusters (CBox / DBox). You will serve as the SME responsible for ensuring reliability, continuity, and performance of mission-critical storage systems powering large-scale AI workloads.
This role requires hands-on experience with VAST Data’s architecture along with strong incident handling, RCA delivery, upgrade execution, and coordination with VAST Support.
Key Responsibilities
- Manage and maintain VAST clusters (CBox / DBox) across multi-node deployments.
- Handle L3-level storage incidents end-to-end, including diagnosis, resolution, and preventive actions.
- Lead storage upgrade planning, scheduling, execution, and rollback readiness.
- Work closely with VAST Support for escalations, bug reviews, and advanced troubleshooting.
- Conduct and publish Root Cause Analyses (RCAs) for critical incidents.
- Perform proactive performance analysis and tuning to support demanding AI / ML workloads.
- Collaborate with internal SRE, Platform, and Infrastructure teams to ensure system stability.
- Maintain detailed documentation of configurations, runbooks, and upgrade paths.
- Contribute to building a Dedicated VAST SME Pool ensuring long-term continuity in upgrades, RCAs, and performance investigations.
Required Skills & Experience
Hands-on experience with VAST Data storage platform , including CBox / DBox.Strong understanding of distributed storage systems, NVMe, RDMA, NFS / SMB / ISCSI protocols.Proven expertise in troubleshooting large-scale storage environments.Experience coordinating with storage OEM / vendor support teams.Ability to work during business hours and extended hours , if required (no 24×7 shift model).Solid understanding of monitoring, capacity planning, and performance engineering.Strong analytical and documentation skills.Nice to Have
Experience supporting HPC, GPU clusters, or AI workloads.Background in SRE, distributed systems, or data center operations.Automation / scripting skills (Python, Bash, Ansible, etc.).