Job Purpose
Role Summary :
Responsible for architecting, designing, and implementing GPU-enabled, High Performance Computing (HPC), and AI / ML platform solutions. This role involves building scalable, secure, and optimized platforms to support enterprise AI, ML / DL workloads, simulation, and large-scale data analytics. The architect will define the infrastructure strategy, workload placement, performance optimization, and managed services roadmap for GPU and HPC platforms within the Data Centre (DC) business.
Role Description
Key Responsibilities :
- Platform Architecture & Design :
- Architect GPU and HPC infrastructure platforms for AI / ML training, inference, and HPC workloads.
- Design GPUaaS (GPU-as-a-Service) models, including on-demand, reserved, and burst GPU clusters.
- Integrate AI / ML frameworks (TensorFlow, PyTorch, KubeFlow, JupyterHub, etc.) into enterprise-ready stacks.
- Infrastructure & Workload Optimization :
- Ensure performance tuning, resource scheduling, and workload orchestration across HPC clusters and GPU nodes.
- Optimize for distributed training, model parallelism, and storage bandwidth (NVMe, Lustre, GPFS, Ceph).
- AI / ML Platform Enablement :
- Provide cloud-native environments with containerized ML workflows (Kubernetes, Docker, Singularity).
- Build and manage model hosting & inference platforms (REST APIs, containerized inference servers).
- Security & Compliance :
- Implement data security, encryption, access control, and compliance frameworks for sensitive AI / HPC workloads.
- Architect air-gapped solutions for government / defense workloads when required.
- Technology Integration & Innovation :
- Evaluate and integrate next-gen GPUs (NVIDIA H200 / A100 / L40S, AMD MI300, etc.), HPC accelerators, and AI chipsets.
- Enable hybrid / hyperconverged AI infrastructure (GPU + CPU + storage).
- Customer & Business Enablement :
- Collaborate with data scientists, researchers, and enterprise customers to align platform capabilities with business outcomes.
- Define GPU / HPC platform services catalog and managed service offerings.
- Automation & DevOps :
- Implement MLOps pipelines, infrastructure as code (Terraform, Ansible), and workload scheduling (SLURM, Kubernetes).
Experience & Educational Requirements
Qualifications and Experience
EDUCATIONAL QUALIFICATIONS : (degree, training, or certification required)
BE / B-Tech or equivalent with Computer Science or Electronics & Communication
RELEVANT EXPERIENCE : (no. of years of technical, functional, and / or leadership experience or specific exposure required)
Experience : 8–12 years overall IT experience, with 5+ years in HPC / AI / ML / GPU platform architecture.Technical Expertise :Strong background in GPU architecture (NVIDIA, AMD) and HPC systems.Proficiency in AI / ML frameworks (TensorFlow, PyTorch, Keras, MXNet, Hugging Face).Experience with distributed training and orchestration frameworks (KubeFlow, MLflow, Ray, Horovod).Knowledge of parallel computing, MPI, CUDA, ROCm, and GPU drivers.Familiarity with storage technologies for HPC / AI (NVMe, Lustre, GPFS, Ceph, Object Storage).Cloud & Hybrid AI Platforms : Hands-on with GPU cloud offerings (AWS Sagemaker, Azure ML, GCP Vertex AI) and on-prem HPC cluster management.Automation & MLOps : Experience with CI / CD for ML (MLOps), workflow automation, and infrastructure as code.Security & Governance : Knowledge of data privacy, DPDP Act, compliance (ISO, PCI-DSS, HIPAA), and secure GPU cluster design.Certifications (Preferred) : NVIDIA Certified AI Specialist, Azure AI Engineer, AWS ML Specialty, or HPC-related certifications.Soft Skills : Strong stakeholder communication, ability to collaborate with data scientists, researchers, and enterprise IT teams, and capability to align technical solutions to business objectives.