Talent.com
No longer accepting applications
Gpu Infrastructure & Data Center Engineer

Gpu Infrastructure & Data Center Engineer

PhoQtek labsHyderabad, Republic Of India, IN
6 days ago
Job description

About the Role

We are seeking a highly skilled IT Solutions & GPU Infrastructure Lead to take complete ownership of our GPU-based server infrastructure. This role focuses on next-generation GPU systems used for AI / ML workloads, covering every aspect from data center colocation and setup to GPU slicing, MIG management, resource allocation, optimization, and compliance. You will lead the end-to-end lifecycle of GPU infrastructure — ensuring all servers are optimized, secure, and production-ready for both internal and customer use.

Key Responsibilities

  • Colocation & Infrastructure Setup

GPU colocation and end-to-end infrastructure setup will be entirely under your ownership and responsibility.

  • Coordinate with data centers for rack installation, power, and cooling.
  • Deploy and configure GPU-based servers for production readiness.
  • 2. GPU & AI / ML Infrastructure

  • Manage GPU slicing and MIG (Multi-Instance GPU) for multi-tenant workloads.
  • Install and maintain the NVIDIA software stack — CUDA, cuDNN, NCCL, and DCGM.
  • Optimize GPU infrastructure for AI / ML workloads (TensorFlow, PyTorch, RAPIDS).
  • Support multi-GPU scaling using NVLink and PCIe passthrough.
  • 3. Systems & Virtualization

  • Administer Linux-based environments (Ubuntu, CentOS, Rocky) along with other environments.
  • Manage virtualization platforms such as VMware, KVM, or Proxmox with GPU passthrough.
  • Handle container orchestration with Docker and Kubernetes GPU Operators.
  • Integrate high-performance storage (NFS, Ceph, SAN / NAS) for large-scale datasets.
  • 4. Monitoring & Performance Optimization

  • Monitor GPU and system performance using Prometheus, Grafana, NVIDIA DCGM, and nvidia-smi.
  • Proactively detect, analyze, and resolve GPU or system bottlenecks.
  • Optimize GPU nodes for training and inference performance.
  • Implement structured logging, alerts, and usage reporting.
  • one should have to administer, manage, monitor and maintain GPU infrastructure for AI workloads.
  • 5. Security & Compliance

  • Harden GPU servers for multi-tenant workloads.
  • Manage driver, firmware, and software license compliance.
  • Ensure infrastructure security and audit readiness with periodic patching and updates.
  • 6. Networking & High-Performance I / O

  • Configure and maintain high-speed network fabrics (InfiniBand, RDMA, RoCE).
  • Optimize low-latency interconnects for distributed GPU workloads.
  • Troubleshoot and enhance data transfer performance.
  • 7. Customer & Infrastructure Ownership

  • Serve as the primary contact for GPU resource allocation.
  • Provision GPU slices or MIG instances for internal and external teams.
  • Troubleshoot, document, and optimize workload performance.
  • Qualifications

  • Proven experience in data center server setup and colocation.
  • Deep expertise in GPU server administration (NVIDIA A100 / H100 or equivalent).
  • Strong working knowledge of GPU slicing, MIG, CUDA, NCCL, and NVIDIA drivers.
  • Experience with Linux administration, virtualization (VMware / KVM / Proxmox), and containers (Docker / Kubernetes).
  • Hands-on experience with AI / ML frameworks such as TensorFlow and PyTorch.
  • Familiarity with monitoring tools (Prometheus, Grafana, DCGM).
  • Knowledge of storage systems (NFS, Ceph) and high-performance networking.
  • Strong vendor coordination and infrastructure management skills.
  • Why This Role Matters

    This position owns the entire lifecycle of GPU-based infrastructure — from colocation to slicing, monitoring, and optimization. You will build and maintain the backbone of our AI / ML infrastructure, ensuring that all systems are efficient, scalable, and production-grade.

    Create a job alert for this search

    Infrastructure Engineer • Hyderabad, Republic Of India, IN

    Related jobs
    • Promoted
    Cloud Engineer

    Cloud Engineer

    TrianzHyderabad, Telangana, India
    Cloud Server admin is responsible to Monitoring Cloud infrastructure server & Cloud Security management, Managing Inventory, Vulnerability assessment Updating security patches & AV Cloud Accounts a...Show moreLast updated: 27 days ago
    • Promoted
    Cloud Infrastructure Reliability Engineer

    Cloud Infrastructure Reliability Engineer

    TMUS Global SolutionsHyderabad, Republic Of India, IN
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 21 days ago
    • Promoted
    Infrastructure Engineer - On-Premises / Cloud

    Infrastructure Engineer - On-Premises / Cloud

    ImpacteersHyderabad
    About the Role : We are seeking a highly skilled Infrastructure Engineer to design, build, and maintain the scalable, secure, and resilient infrastructure that suppo...Show moreLast updated: 30+ days ago
    • Promoted
    ThoughtSpot - Senior Cloud Engineer - IAC Terraform

    ThoughtSpot - Senior Cloud Engineer - IAC Terraform

    THOUGHTSPOT INDIA PRIVATE LIMITEDHyderabad
    JD for Role : Senior Cloud Engineer.Location : Hyderabad.What You'll Be Doing : - Architect, develop, and oversee the dep...Show moreLast updated: 28 days ago
    • Promoted
    Senior Cloud Engineer

    Senior Cloud Engineer

    ThoughtSpotHyderabad, Telangana, India
    About the role : ThoughtSpot is seeking a Senior Cloud Engineer to join our Build and Infrastructure team.In this senior role, you’ll leverage your deep expertise in cloud environments (AWS p...Show moreLast updated: 9 days ago
    • Promoted
    Egen - GCP Cloud Infrastructure Lead

    Egen - GCP Cloud Infrastructure Lead

    SPRINGML INDIA DEVELOPMENT CENTER PRIVATE LIMITEDHyderabad
    We are seeking a highly skilled and experienced Lead Infrastructure Engineer to join our dynamic team.The ideal candidate will be passionate about building and maintaining complex systems, with a h...Show moreLast updated: 30+ days ago
    • Promoted
    Platform Engineer - Cloud Infrastructure

    Platform Engineer - Cloud Infrastructure

    Mars Telecom Systems Pvt. Ltd.Hyderabad
    Required : - 5+ years of experience building and operating AWS infrastructure at scale.Strong expertise in Infrastructur...Show moreLast updated: 30+ days ago
    • Promoted
    Platform Engineer - Cloud Infrastructure

    Platform Engineer - Cloud Infrastructure

    hirezy.aiHyderabad
    Requirements : - Extensive experience designing, deploying and managing scalable and resilient solutions and cloud-nati...Show moreLast updated: 17 days ago
    • Promoted
    Egen - Lead Infrastructure Engineer - Google Cloud Platform

    Egen - Lead Infrastructure Engineer - Google Cloud Platform

    SPRINGML INDIA DEVELOPMENT CENTER PRIVATE LIMITEDHyderabad
    Job title : Lead Infrastructure Engineer GCP Location : Hyderabad Exp : 10 -15 <...Show moreLast updated: 30+ days ago
    • Promoted
    Senior DevOps Engineer - Cloud Infrastructure

    Senior DevOps Engineer - Cloud Infrastructure

    SKS EnterprisesHyderabad
    Job Summary : We are looking for a highly skilled Senior DevOps Engineer to join our team.You will be responsible for designing, implementing, and...Show moreLast updated: 29 days ago
    • Promoted
    GCP Cloud Engineer

    GCP Cloud Engineer

    ParadigmITHyderabad, Telangana, India
    Google Cloud Platform DevOps Engineer.ParadigmIT is seeking a seasoned GCP DevOps Engineer to design implement and manage scalable secure and resilient infrastructure on Google Cloud Platform GCP.T...Show moreLast updated: 8 days ago
    • Promoted
    • New!
    [Only 24h Left] Infrastructure Engineer

    [Only 24h Left] Infrastructure Engineer

    Tekskills Inc.Hyderabad, Telangana, India
    We are seeking a seasoned Infrastructure Engineer with strong expertise in Oracle Linux Virtualization Manager (OLVM), and a solid understanding of any IT Industry or systems.The ideal candidate wi...Show moreLast updated: 3 hours ago
    • Promoted
    Data Infrastructure Engineer

    Data Infrastructure Engineer

    TMUS Global SolutionsHyderabad, Republic Of India, IN
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 21 days ago
    • Promoted
    Cloud Infrastructure Engineer

    Cloud Infrastructure Engineer

    Tata Consultancy ServicesHyderabad, Republic Of India, IN
    Incident & Problem Management : Handle.L1 support, conduct root-cause analysis for incidents, and implement corrective actions. Experience using ITIL tools like Service Now.Troubleshoot and manage va...Show moreLast updated: 30+ days ago
    • Promoted
    Network Infrastructure Engineer

    Network Infrastructure Engineer

    Tata Consultancy ServicesHyderabad, Republic Of India, IN
    Skill : Network Switching(WAN Technology).Design, Deployment, and upgrade of PTP and CMPLS WAN circuits (MPLS, BGP, VRFs). Experience with BGP routing protocol on internal Enterprise Networks.Cisco P...Show moreLast updated: 30+ days ago
    • Promoted
    Cloud Engineer

    Cloud Engineer

    VML Enterprise Solutionshyderabad, India
    Type of position : Contract- 6 months.We are looking for talented developers with experience developing complex Cloud AWS serverless applications to join an exciting client project as part of our ex...Show moreLast updated: 11 days ago
    • Promoted
    Infrastructure Engineer

    Infrastructure Engineer

    Tekskills Inc.Hyderabad, Telangana, India
    We are seeking a seasoned Infrastructure Engineer with strong expertise in Oracle Linux Virtualization Manager (OLVM) , and a solid understanding of any IT Industry or systems.The ideal candid...Show moreLast updated: 6 days ago
    • Promoted
    Lead Infrastructure Engineer - MFT / AWS

    Lead Infrastructure Engineer - MFT / AWS

    Talks About People HR SolutionsHyderabad
    Key Responsibilities : - Serve as an individual contributor and technical coach, leading and guiding the team.Provide consultancy and solutions to customers across various prod...Show moreLast updated: 30+ days ago