Talent.com
GPU Infrastructure & Data Center Engineer

GPU Infrastructure & Data Center Engineer

PhoQtek labsHyderabad, Telangana, India
5 days ago
Job description

About the Role

We are seeking a highly skilled IT Solutions & GPU Infrastructure Lead to take complete ownership of our GPU-based server infrastructure. This role focuses on next-generation GPU systems used for AI / ML workloads, covering every aspect from data center colocation and setup to GPU slicing, MIG management, resource allocation, optimization, and compliance. You will lead the end-to-end lifecycle of GPU infrastructure — ensuring all servers are optimized, secure, and production-ready for both internal and customer use.

Key Responsibilities

  • Colocation & Infrastructure Setup

GPU colocation and end-to-end infrastructure setup will be entirely under your ownership and responsibility.

  • Coordinate with data centers for rack installation, power, and cooling.
  • Deploy and configure GPU-based servers for production readiness.
  • 2. GPU & AI / ML Infrastructure

  • Manage GPU slicing and MIG (Multi-Instance GPU) for multi-tenant workloads.
  • Install and maintain the NVIDIA software stack — CUDA, cuDNN, NCCL, and DCGM.
  • Optimize GPU infrastructure for AI / ML workloads (TensorFlow, PyTorch, RAPIDS).
  • Support multi-GPU scaling using NVLink and PCIe passthrough.
  • 3. Systems & Virtualization

  • Administer Linux-based environments (Ubuntu, CentOS, Rocky) along with other environments.
  • Manage virtualization platforms such as VMware, KVM, or Proxmox with GPU passthrough.
  • Handle container orchestration with Docker and Kubernetes GPU Operators.
  • Integrate high-performance storage (NFS, Ceph, SAN / NAS) for large-scale datasets.
  • 4. Monitoring & Performance Optimization

  • Monitor GPU and system performance using Prometheus, Grafana, NVIDIA DCGM, and nvidia-smi.
  • Proactively detect, analyze, and resolve GPU or system bottlenecks.
  • Optimize GPU nodes for training and inference performance.
  • Implement structured logging, alerts, and usage reporting.
  • one should have to administer, manage, monitor and maintain GPU infrastructure for AI workloads.
  • 5. Security & Compliance

  • Harden GPU servers for multi-tenant workloads.
  • Manage driver, firmware, and software license compliance.
  • Ensure infrastructure security and audit readiness with periodic patching and updates.
  • 6. Networking & High-Performance I / O

  • Configure and maintain high-speed network fabrics (InfiniBand, RDMA, RoCE).
  • Optimize low-latency interconnects for distributed GPU workloads.
  • Troubleshoot and enhance data transfer performance.
  • 7. Customer & Infrastructure Ownership

  • Serve as the primary contact for GPU resource allocation.
  • Provision GPU slices or MIG instances for internal and external teams.
  • Troubleshoot, document, and optimize workload performance.
  • Qualifications

  • Proven experience in data center server setup and colocation.
  • Deep expertise in GPU server administration (NVIDIA A100 / H100 or equivalent).
  • Strong working knowledge of GPU slicing, MIG, CUDA, NCCL, and NVIDIA drivers.
  • Experience with Linux administration, virtualization (VMware / KVM / Proxmox), and containers (Docker / Kubernetes).
  • Hands-on experience with AI / ML frameworks such as TensorFlow and PyTorch.
  • Familiarity with monitoring tools (Prometheus, Grafana, DCGM).
  • Knowledge of storage systems (NFS, Ceph) and high-performance networking.
  • Strong vendor coordination and infrastructure management skills.
  • Why This Role Matters

    This position owns the entire lifecycle of GPU-based infrastructure — from colocation to slicing, monitoring, and optimization. You will build and maintain the backbone of our AI / ML infrastructure, ensuring that all systems are efficient, scalable, and production-grade.

    Create a job alert for this search

    Infrastructure Engineer • Hyderabad, Telangana, India