Job Title : Lead Solutions Architect – AI Infrastructure & Private Cloud
Location : Bengaluru (Electronic City)
Experience : 10–15 Years (Lead / Architect Level)
Position Type : Full-Time | Immediate Joiners Preferred
Criticality : High
Role Overview :
We are seeking a Lead Solutions Architect specializing in AI Infrastructure and Private Cloud to design and deliver scalable, high-performance compute environments for machine learning, deep learning, and AI workloads. The ideal candidate will have deep expertise in Kubernetes , container orchestration , GPU / TPU acceleration , and HPC (High Performance Computing) architectures, enabling AI-driven innovation across enterprise platforms.
Key Responsibilities :
- Architect, design, and implement AI / ML infrastructure solutions across private and hybrid cloud environments.
- Lead setup and optimization of Kubernetes Landing Zones , including cluster design, multi-tenancy, and security.
- Manage containerized workloads using orchestration tools (Kubernetes, Docker, Podman, OpenShift).
- Integrate AI accelerators (NVIDIA GPUs, TPUs) for ML / DL model training and inference.
- Enable deployment of deep learning models with a focus on hardware acceleration, scalability, and performance tuning.
- Build and maintain edge and cloud-native deployment pipelines for AI workloads.
- Collaborate with AI / ML and DevOps teams to ensure robust CI / CD workflows for model deployment.
- Drive HPC architecture design , including compute, storage, networking, and scheduling (SLURM, PBS, etc.).
- Optimize HPC and AI infrastructure for cost, performance, and resource utilization.
- Provide technical leadership in evaluating and integrating emerging technologies (AI frameworks, MLOps platforms, accelerator hardware).
- Define standards, documentation, and best practices for AI infrastructure operations.
Required Technical Skills :
Containerization & Orchestration : Kubernetes, Docker, Helm, OpenShift, RancherCloud Platforms : AWS, Azure, GCP (Private & Hybrid Cloud expertise preferred)AI / ML Infrastructure : NVIDIA GPU integration, CUDA, TensorRT, TPUs, PyTorch / TensorFlow deploymentHigh Performance Computing (HPC) : HPC architecture, schedulers (SLURM, PBS), parallel computing, storage & network optimizationDevOps & CI / CD : GitHub Actions, Jenkins, ArgoCD, Terraform, AnsibleMonitoring & Observability : Prometheus, Grafana, ELK StackScripting / Programming : Python, Bash, YAML, Go (preferred)Desired Skills :
Experience with RAG / LLM model deployment pipelines or AI workload orchestrationKnowledge of edge computing and distributed inference systemsExposure to AI model lifecycle management (MLOps)Strong problem-solving, leadership, and cross-functional collaboration skills