Talent.com
No longer accepting applications
▷ Apply in 3 Minutes : GPU Infrastructure & Data Center Engineer

▷ Apply in 3 Minutes : GPU Infrastructure & Data Center Engineer

PhoQtek labsHyderabad, Telangana, India
1 day ago
Job description

About the Role

We are seeking a highly skilled IT Solutions & GPU Infrastructure Lead to take complete ownership of our GPU-based server infrastructure. This role focuses on next-generation GPU systems used for AI / ML workloads, covering every aspect from data center colocation and setup to GPU slicing, MIG management, resource allocation, optimization, and compliance. You will lead the end-to-end lifecycle of GPU infrastructure — ensuring all servers are optimized, secure, and production-ready for both internal and customer use.

Key Responsibilities

1. Colocation & Infrastructure Setup

GPU colocation and end-to-end infrastructure setup will be entirely under your ownership and responsibility.

  • Coordinate with data centers for rack installation, power, and cooling.
  • Deploy and configure GPU-based servers for production readiness.

2. GPU & AI / ML Infrastructure

  • Manage GPU slicing and MIG (Multi-Instance GPU) for multi-tenant workloads.
  • Install and maintain the NVIDIA software stack — CUDA, cuDNN, NCCL, and DCGM.
  • Optimize GPU infrastructure for AI / ML workloads (TensorFlow, PyTorch, RAPIDS).
  • Support multi-GPU scaling using NVLink and PCIe passthrough.
  • 3. Systems & Virtualization

  • Administer Linux-based environments (Ubuntu, CentOS, Rocky) along with other environments.
  • Manage virtualization platforms such as VMware, KVM, or Proxmox with GPU passthrough.
  • Handle container orchestration with Docker and Kubernetes GPU Operators.
  • Integrate high-performance storage (NFS, Ceph, SAN / NAS) for large-scale datasets.
  • 4. Monitoring & Performance Optimization

  • Monitor GPU and system performance using Prometheus, Grafana, NVIDIA DCGM, and nvidia-smi.
  • Proactively detect, analyze, and resolve GPU or system bottlenecks.
  • Optimize GPU nodes for training and inference performance.
  • Implement structured logging, alerts, and usage reporting.
  • one should have to administer, manage, monitor and maintain GPU infrastructure for AI workloads.
  • 5. Security & Compliance

  • Harden GPU servers for multi-tenant workloads.
  • Manage driver, firmware, and software license compliance.
  • Ensure infrastructure security and audit readiness with periodic patching and updates.
  • 6. Networking & High-Performance I / O

  • Configure and maintain high-speed network fabrics (InfiniBand, RDMA, RoCE).
  • Optimize low-latency interconnects for distributed GPU workloads.
  • Troubleshoot and enhance data transfer performance.
  • 7. Customer & Infrastructure Ownership

  • Serve as the primary contact for GPU resource allocation.
  • Provision GPU slices or MIG instances for internal and external teams.
  • Troubleshoot, document, and optimize workload performance.
  • Qualifications

  • Proven experience in data center server setup and colocation.
  • Deep expertise in GPU server administration (NVIDIA A100 / H100 or equivalent).
  • Strong working knowledge of GPU slicing, MIG, CUDA, NCCL, and NVIDIA drivers.
  • Experience with Linux administration, virtualization (VMware / KVM / Proxmox), and containers (Docker / Kubernetes).
  • Hands-on experience with AI / ML frameworks such as TensorFlow and PyTorch.
  • Familiarity with monitoring tools (Prometheus, Grafana, DCGM).
  • Knowledge of storage systems (NFS, Ceph) and high-performance networking.
  • Strong vendor coordination and infrastructure management skills.
  • Why This Role Matters

    This position owns the entire lifecycle of GPU-based infrastructure — from colocation to slicing, monitoring, and optimization. You will build and maintain the backbone of our AI / ML infrastructure, ensuring that all systems are efficient, scalable, and production-grade.

    Create a job alert for this search

    Apply Infrastructure • Hyderabad, Telangana, India

    Related jobs
    • Promoted
    GPU Infrastructure & Data Center Engineer

    GPU Infrastructure & Data Center Engineer

    PhoQtek labsHyderabad, Telangana, India
    We are seeking a highly skilled IT Solutions & GPU Infrastructure Lead to take complete ownership of our GPU-based server infrastructure. This role focuses on next-generation GPU systems used for AI...Show moreLast updated: 5 days ago
    • Promoted
    Data Engineer L3 - Data Architecture [T500-20144]

    Data Engineer L3 - Data Architecture [T500-20144]

    Costco ITHyderabad, Telangana, India
    Costco Wholesale is a multi-billion-dollar global retailer with warehouse club operations in eleven countries.They provide a wide selection of quality merchandise, plus the convenience of specialty...Show moreLast updated: 30+ days ago
    • Promoted
    Data Centre Linux & HW Engineer, India, HYD-Infinity - DCO

    Data Centre Linux & HW Engineer, India, HYD-Infinity - DCO

    AmazonHyderabad, Telangana, India
    This job is with Amazon, an inclusive employer and a member of myGwork – the largest global platform for the LGBTQ+ business community. Please do not contact the recruiter directly.DESCRIPTION : AWS ...Show moreLast updated: 4 days ago
    • Promoted
    • New!
    Data Center System

    Data Center System

    Anicalls (Pty) Ltdhyderabad, India
    Be able to work with and employ rack level, redundancy levels and techniques, rack standards, properties, selection criteria, power rail / strip options, cold aisle / hot aisle containment, and fire su...Show moreLast updated: 11 hours ago
    • Promoted
    Tririga Infrastructure Engineer, Close Systems

    Tririga Infrastructure Engineer, Close Systems

    AmazonHyderabad, Telangana, India
    This job is with Amazon, an inclusive employer and a member of myGwork – the largest global platform for the LGBTQ+ business community. Please do not contact the recruiter directly.DESCRIPTION : Are ...Show moreLast updated: 3 days ago
    • Promoted
    • New!
    Data Center

    Data Center

    Anicalls (Pty) Ltdhyderabad, India
    Implement, maintain, and lead improvement projects on Windows and Linux infrastructure.Planning upgrades, implementing configuration changes, extending and replacing engineering IT systems.Work wit...Show moreLast updated: 11 hours ago
    • Promoted
    Data Engineer III [T500-19720]

    Data Engineer III [T500-19720]

    McDonald'sHyderabad, Telangana, India
    About McDonald’s : One of the world’s largest employers with locations in more than 100 countries, McDonald’s Corporation has corporate opportunities in Hyderabad. Our global offices serve as dynami...Show moreLast updated: 17 days ago
    • Promoted
    Hiring for AWS Infra Engineer / Architect

    Hiring for AWS Infra Engineer / Architect

    Tata Consultancy ServicesHyderabad, Telangana, India
    Dear Tech Professional Greetings from Tata Consultancy Services (TCS) TCS has always been in the spotlight for being adept in “the next big technologies”. What we can offer you is a space to expl...Show moreLast updated: 17 days ago
    • Promoted
    Gpu Infrastructure & Data Center Engineer

    Gpu Infrastructure & Data Center Engineer

    PhoQtek labsHyderabad, Republic Of India, IN
    We are seeking a highly skilled IT Solutions & GPU Infrastructure Lead to take complete ownership of our GPU-based server infrastructure. This role focuses on next-generation GPU systems used for AI...Show moreLast updated: 5 days ago
    • Promoted
    Telecom Engineer- Genesys Cloud & Avaya

    Telecom Engineer- Genesys Cloud & Avaya

    TPHyderabad, Telangana, India
    Experience administrating and supporting Genesys Cloud.Exp in Inbound and outbound Call management on Genesys Cloud.Support and administration of Genesys Cloud that includes Move, Add, Change and d...Show moreLast updated: 13 days ago
    • Promoted
    Engineer, Data [T500-20281]

    Engineer, Data [T500-20281]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 20 days ago
    • Promoted
    Senior Infra-Devops Engineer

    Senior Infra-Devops Engineer

    Brace Infotech Private LtdHyderabad, India
    Designation / Title(Only immediate Joiner's).Years of Total experience in Information Technology.Hands on experience in Cloud Technologies (AWS and Azure). Hands on experience in Application Developme...Show moreLast updated: 15 days ago
    • Promoted
    Infrastructure Engineer

    Infrastructure Engineer

    Tekskills Inc.Hyderabad, Telangana, India
    We are seeking a seasoned Infrastructure Engineer with strong expertise in Oracle Linux Virtualization Manager (OLVM) , and a solid understanding of any IT Industry or systems.The ideal candid...Show moreLast updated: 6 days ago
    • Promoted
    • New!
    Apply in 3 Minutes : Teamcenter Developer

    Apply in 3 Minutes : Teamcenter Developer

    Webologix Ltd / INCHyderabad, Telangana, India
    Job Position : Teamcenter Developer.Job Type : -Contract to Hire (C2H).Location : Noida / Gurgaon / Hyderabad / Bangalore / Pune. Work Mode : 3 days per week in the office and 2 days from home.Design, develop,...Show moreLast updated: less than 1 hour ago
    • Promoted
    Software Engineer (.NET & Angular) - Intern

    Software Engineer (.NET & Angular) - Intern

    NRG Foods Pvt.Ltd.Uppal, Telangana, India
    Indian ethnic foods and groceries, renowned for its extensive portfolio and expertise in international trade.With a wide range of products that includes staple items such as rice, atta (flour), len...Show moreLast updated: 3 days ago
    • Promoted
    Engineer, Data [T500-20293]

    Engineer, Data [T500-20293]

    TMUS Global SolutionsHyderabad, Telangana, India
    About T-Mobile : T-Mobile US, Inc.NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship b...Show moreLast updated: 21 days ago
    • Promoted
    Cloud Engineer Ii T500-20908

    Cloud Engineer Ii T500-20908

    McDonald'sHyderabad, Republic Of India, IN
    One of the world’s largest employers with locations in more than 100 countries, McDonald’s Corporation has corporate opportunities in Hyderabad. Our global offices serve as dynamic innovation and op...Show moreLast updated: 15 days ago
    • Promoted
    Hiring for Azure Infra Engineer / Architect

    Hiring for Azure Infra Engineer / Architect

    Tata Consultancy ServicesHyderabad, Telangana, India
    Tata Consultancy Services (TCS).TCS has always been in the spotlight for being adept in “the next big technologies”.What we can offer you is a space to explore varied technologies and quench your t...Show moreLast updated: 16 days ago