Job Purpose
Role Summary :
Responsible for architecting, designing, and implementing GPU-enabled, High Performance Computing (HPC), and AI / ML platform solutions. This role involves building scalable, secure, and optimized platforms to support enterprise AI, ML / DL workloads, simulation, and large-scale data analytics. The architect will define the infrastructure strategy, workload placement, performance optimization, and managed services roadmap for GPU and HPC platforms within the Data Centre (DC) business.
Role Description
Key Responsibilities :
Architect GPU and HPC infrastructure platforms for AI / ML training, inference, and HPC workloads.
Design GPUaaS (GPU-as-a-Service) models, including on-demand, reserved, and burst GPU clusters.
Integrate AI / ML frameworks (TensorFlow, PyTorch, KubeFlow, JupyterHub, etc.) into enterprise-ready stacks.
Ensure performance tuning, resource scheduling, and workload orchestration across HPC clusters and GPU nodes.
Optimize for distributed training, model parallelism, and storage bandwidth (NVMe, Lustre, GPFS, Ceph).
Provide cloud-native environments with containerized ML workflows (Kubernetes, Docker, Singularity).
Build and manage model hosting & inference platforms (REST APIs, containerized inference servers).
Implement data security, encryption, access control, and compliance frameworks for sensitive AI / HPC workloads.
Architect air-gapped solutions for government / defense workloads when required.
Evaluate and integrate next-gen GPUs (NVIDIA H200 / A100 / L40S, AMD MI300, etc.), HPC accelerators, and AI chipsets.
Enable hybrid / hyperconverged AI infrastructure (GPU + CPU + storage).
Collaborate with data scientists, researchers, and enterprise customers to align platform capabilities with business outcomes.
Define GPU / HPC platform services catalog and managed service offerings.
Implement MLOps pipelines , infrastructure as code (Terraform, Ansible), and workload scheduling (SLURM, Kubernetes).
Experience & Educational Requirements
Qualifications and Experience
EDUCATIONAL QUALIFICATIONS : (degree, training, or certification required)
BE / B-Tech or equivalent with Computer Science or Electronics & Communication
RELEVANT EXPERIENCE : (no. of years of technical, functional, and / or leadership experience or specific exposure required)
Experience : 8–12 years overall IT experience, with 5+ years in HPC / AI / ML / GPU platform architecture .
Technical Expertise :
Strong background in GPU architecture (NVIDIA, AMD) and HPC systems .
Proficiency in AI / ML frameworks (TensorFlow, PyTorch, Keras, MXNet, Hugging Face).
Experience with distributed training and orchestration frameworks (KubeFlow, MLflow, Ray, Horovod).
Knowledge of parallel computing, MPI, CUDA, ROCm, and GPU drivers .
Familiarity with storage technologies for HPC / AI (NVMe, Lustre, GPFS, Ceph, Object Storage).
Cloud & Hybrid AI Platforms : Hands-on with GPU cloud offerings (AWS Sagemaker, Azure ML, GCP Vertex AI) and on-prem HPC cluster management .
Automation & MLOps : Experience with CI / CD for ML (MLOps) , workflow automation, and infrastructure as code .
Security & Governance : Knowledge of data privacy, DPDP Act, compliance (ISO, PCI-DSS, HIPAA) , and secure GPU cluster design.
Certifications (Preferred) : NVIDIA Certified AI Specialist, Azure AI Engineer, AWS ML Specialty, or HPC-related certifications.
Soft Skills : Strong stakeholder communication, ability to collaborate with data scientists, researchers, and enterprise IT teams , and capability to align technical solutions to business objectives.
Solution Architect • Chennai, Tamil Nadu, India