Solution Architect AI
Department : Cloud Presales Location : Mumbai / Chennai
Job Purpose
To architect, design, and implement GPU-enabled High Performance Computing (HPC) and AI / ML platform solutions that are scalable, secure, and optimized to support enterprise AI, ML / DL workloads, simulation, and large-scale data analytics. This role will define infrastructure strategies, workload placement, performance tuning, and managed service roadmaps for GPU and HPC platforms within the Data Centre business.
Key Responsibilities
Platform Architecture & Design
- Architect GPU and HPC infrastructure platforms for AI / ML training, inference, and HPC workloads.
- Design GPU-as-a-Service (GPUaaS) models including on-demand, reserved, and burst GPU clusters.
- Integrate AI / ML frameworks (TensorFlow, PyTorch, KubeFlow, JupyterHub, etc.) into enterprise-ready stacks.
Infrastructure & Workload Optimization
Optimize performance tuning, resource scheduling, and workload orchestration across HPC clusters and GPU nodes.Enhance distributed training, model parallelism, and storage bandwidth utilization (NVMe, Lustre, GPFS, Ceph).AI / ML Platform Enablement
Provide cloud-native environments with containerized ML workflows (Kubernetes, Docker, Singularity).Build and manage model hosting and inference platforms (REST APIs, containerized inference servers).Security & Compliance
Implement data security, encryption, access control, and compliance frameworks for sensitive AI / HPC workloads.Architect air-gapped solutions for government / defense workloads when required.Technology Integration & Innovation
Evaluate and integrate next-generation GPUs (NVIDIA H200 / A100 / L40S, AMD MI300, etc.), HPC accelerators, and AI chipsets.Enable hybrid and hyperconverged AI infrastructure combining GPU, CPU, and storage resources.Customer & Business Enablement
Collaborate with data scientists, researchers, and enterprise customers to align platform capabilities with business outcomes.Define GPU / HPC platform services catalog and managed service offerings.Automation & DevOps
Implement MLOps pipelines, infrastructure as code (Terraform, Ansible), and workload scheduling (SLURM, Kubernetes).Qualifications & Experience
Educational Qualifications
BE / B-Tech or equivalent in Computer Science, Electronics & Communication, or related fields.Experience
8–12 years of overall IT experience, including 5+ years in HPC / AI / ML / GPU platform architecture.Technical Expertise
Strong knowledge of GPU architecture (NVIDIA, AMD) and HPC systems.Proficiency with AI / ML frameworks such as TensorFlow, PyTorch, Keras, MXNet, Hugging Face.Experience with distributed training and orchestration frameworks like KubeFlow, MLflow, Ray, Horovod.Knowledge of parallel computing, MPI, CUDA, ROCm, and GPU drivers.Familiarity with storage technologies such as NVMe, Lustre, GPFS, Ceph, and Object Storage for HPC / AI workloads.Hands-on experience with GPU cloud platforms (AWS Sagemaker, Azure ML, GCP Vertex AI) and on-prem HPC cluster management.Automation and MLOps expertise : CI / CD pipelines for ML, infrastructure as code, and workflow automation.Understanding of security and governance including data privacy laws (e.g., DPDP Act), ISO, PCI-DSS, HIPAA compliance, and secure GPU cluster design.Certifications (Preferred)
NVIDIA Certified AI SpecialistAzure AI EngineerAWS ML SpecialtyHPC-related certificationsSoft Skills
Strong stakeholder communication and collaboration skills.Ability to work effectively with data scientists, researchers, and enterprise IT teams.Align technical solutions to business objectives with a strategic mindset.