Job Purpose
Role Summary :
Responsible for architecting, designing, and implementing GPU-enabled, High Performance Computing (HPC), and AI / ML platform solutions. This role involves building scalable, secure, and optimized platforms to support enterprise AI, ML / DL workloads, simulation, and large-scale data analytics. The architect will define the infrastructure strategy, workload placement, performance optimization, and managed services roadmap for GPU and HPC platforms within the Data Centre (DC) business.
Role Description
Key Responsibilities :
Platform Architecture & Design :
Architect
GPU and HPC infrastructure platforms
for AI / ML training, inference, and HPC workloads.
Design
GPUaaS (GPU-as-a-Service)
models, including on-demand, reserved, and burst GPU clusters.
Integrate
AI / ML frameworks
(TensorFlow, PyTorch, KubeFlow, JupyterHub, etc.) into enterprise-ready stacks.
Infrastructure & Workload Optimization :
Ensure
performance tuning, resource scheduling, and workload orchestration
across HPC clusters and GPU nodes.
Optimize for
distributed training, model parallelism, and storage bandwidth
(NVMe, Lustre, GPFS, Ceph).
AI / ML Platform Enablement :
Provide cloud-native environments with
containerized ML workflows
(Kubernetes, Docker, Singularity).
Build and manage
model hosting & inference platforms
(REST APIs, containerized inference servers).
Security & Compliance :
Implement
data security, encryption, access control, and compliance frameworks
for sensitive AI / HPC workloads.
Architect
air-gapped solutions
for government / defense workloads when required.
Technology Integration & Innovation :
Evaluate and integrate
next-gen GPUs (NVIDIA H200 / A100 / L40S, AMD MI300, etc.),
HPC accelerators, and AI chipsets.
Enable hybrid / hyperconverged AI infrastructure (GPU + CPU + storage).
Customer & Business Enablement :
Collaborate with data scientists, researchers, and enterprise customers to align platform capabilities with business outcomes.
Define
GPU / HPC platform services catalog
and managed service offerings.
Automation & DevOps :
Implement
MLOps pipelines , infrastructure as code (Terraform, Ansible), and workload scheduling (SLURM, Kubernetes).
Experience & Educational Requirements
Qualifications and Experience
EDUCATIONAL QUALIFICATIONS :
(degree, training, or certification required)
BE / B-Tech or equivalent with Computer Science or Electronics & Communication
RELEVANT EXPERIENCE :
(no. of years of technical, functional, and / or leadership experience or specific exposure required)
Experience : 8–12 years overall IT experience, with
5+ years in HPC / AI / ML / GPU platform architecture .
Technical Expertise :
Strong background in
GPU architecture (NVIDIA, AMD)
and
HPC systems .
Proficiency in
AI / ML frameworks
(TensorFlow, PyTorch, Keras, MXNet, Hugging Face).
Experience with
distributed training and orchestration frameworks
(KubeFlow, MLflow, Ray, Horovod).
Knowledge of
parallel computing, MPI, CUDA, ROCm, and GPU drivers .
Familiarity with
storage technologies
for HPC / AI (NVMe, Lustre, GPFS, Ceph, Object Storage).
Cloud & Hybrid AI Platforms : Hands-on with
GPU cloud offerings
(AWS Sagemaker, Azure ML, GCP Vertex AI) and
on-prem HPC cluster management .
Automation & MLOps : Experience with
CI / CD for ML (MLOps) , workflow automation, and
infrastructure as code .
Security & Governance : Knowledge of
data privacy, DPDP Act, compliance (ISO, PCI-DSS, HIPAA) , and secure GPU cluster design.
Certifications (Preferred) : NVIDIA Certified AI Specialist, Azure AI Engineer, AWS ML Specialty, or HPC-related certifications.
Soft Skills : Strong stakeholder communication, ability to collaborate with
data scientists, researchers, and enterprise IT teams , and capability to align technical solutions to business objectives.
Solution Architect • Delhi, India