Talent.com
No longer accepting applications
High Salary : GPU Infrastructure & Data Center Engineer

High Salary : GPU Infrastructure & Data Center Engineer

PhoQtek labsHyderabad, Telangana, India
1 day ago
Job description

About the Role

We are seeking a highly skilled IT Solutions & GPU Infrastructure Lead to take complete ownership of our GPU-based server infrastructure. This role focuses on next-generation GPU systems used for AI / ML workloads, covering every aspect from data center colocation and setup to GPU slicing, MIG management, resource allocation, optimization, and compliance. You will lead the end-to-end lifecycle of GPU infrastructure — ensuring all servers are optimized, secure, and production-ready for both internal and customer use.

Key Responsibilities

1. Colocation & Infrastructure Setup

GPU colocation and end-to-end infrastructure setup will be entirely under your ownership and responsibility.

  • Coordinate with data centers for rack installation, power, and cooling.
  • Deploy and configure GPU-based servers for production readiness.

2. GPU & AI / ML Infrastructure

  • Manage GPU slicing and MIG (Multi-Instance GPU) for multi-tenant workloads.
  • Install and maintain the NVIDIA software stack — CUDA, cuDNN, NCCL, and DCGM.
  • Optimize GPU infrastructure for AI / ML workloads (TensorFlow, PyTorch, RAPIDS).
  • Support multi-GPU scaling using NVLink and PCIe passthrough.
  • 3. Systems & Virtualization

  • Administer Linux-based environments (Ubuntu, CentOS, Rocky) along with other environments.
  • Manage virtualization platforms such as VMware, KVM, or Proxmox with GPU passthrough.
  • Handle container orchestration with Docker and Kubernetes GPU Operators.
  • Integrate high-performance storage (NFS, Ceph, SAN / NAS) for large-scale datasets.
  • 4. Monitoring & Performance Optimization

  • Monitor GPU and system performance using Prometheus, Grafana, NVIDIA DCGM, and nvidia-smi.
  • Proactively detect, analyze, and resolve GPU or system bottlenecks.
  • Optimize GPU nodes for training and inference performance.
  • Implement structured logging, alerts, and usage reporting.
  • one should have to administer, manage, monitor and maintain GPU infrastructure for AI workloads.
  • 5. Security & Compliance

  • Harden GPU servers for multi-tenant workloads.
  • Manage driver, firmware, and software license compliance.
  • Ensure infrastructure security and audit readiness with periodic patching and updates.
  • 6. Networking & High-Performance I / O

  • Configure and maintain high-speed network fabrics (InfiniBand, RDMA, RoCE).
  • Optimize low-latency interconnects for distributed GPU workloads.
  • Troubleshoot and enhance data transfer performance.
  • 7. Customer & Infrastructure Ownership

  • Serve as the primary contact for GPU resource allocation.
  • Provision GPU slices or MIG instances for internal and external teams.
  • Troubleshoot, document, and optimize workload performance.
  • Qualifications

  • Proven experience in data center server setup and colocation.
  • Deep expertise in GPU server administration (NVIDIA A100 / H100 or equivalent).
  • Strong working knowledge of GPU slicing, MIG, CUDA, NCCL, and NVIDIA drivers.
  • Experience with Linux administration, virtualization (VMware / KVM / Proxmox), and containers (Docker / Kubernetes).
  • Hands-on experience with AI / ML frameworks such as TensorFlow and PyTorch.
  • Familiarity with monitoring tools (Prometheus, Grafana, DCGM).
  • Knowledge of storage systems (NFS, Ceph) and high-performance networking.
  • Strong vendor coordination and infrastructure management skills.
  • Why This Role Matters

    This position owns the entire lifecycle of GPU-based infrastructure — from colocation to slicing, monitoring, and optimization. You will build and maintain the backbone of our AI / ML infrastructure, ensuring that all systems are efficient, scalable, and production-grade.

    Create a job alert for this search

    Data Infrastructure • Hyderabad, Telangana, India

    Related jobs
    • Promoted
    GPU Infrastructure & Data Center Engineer

    GPU Infrastructure & Data Center Engineer

    PhoQtek labsHyderabad, Telangana, India
    We are seeking a highly skilled IT Solutions & GPU Infrastructure Lead to take complete ownership of our GPU-based server infrastructure. This role focuses on next-generation GPU systems used for AI...Show moreLast updated: 5 days ago
    • Promoted
    Cloud Engineer

    Cloud Engineer

    Response InformaticsHyderabad, Telangana, India
    EC2 (Elastic Compute Cloud) Lambda Elastic Beastalk ECS EKS Light Sail AWS Batch Outposts AWS Egate AWS Fargate Compute Optimizer S3 EBS EFS Fsx Glacier Storage Gateway BACKUP Snow Family RDS Auror...Show moreLast updated: 30+ days ago
    • Promoted
    Administrator

    Administrator

    MNR UniversitySangareddy, Telangana, India
    Assistant / Deputy / Senior Manager.Post Graduate will be preferred.Good written and oral communication skills in English. Excellent knowledge in computer applications (MS Office and other any accounts....Show moreLast updated: 17 days ago
    • Promoted
    Egen - GCP Cloud Infrastructure Lead

    Egen - GCP Cloud Infrastructure Lead

    SPRINGML INDIA DEVELOPMENT CENTER PRIVATE LIMITEDHyderabad
    We are seeking a highly skilled and experienced Lead Infrastructure Engineer to join our dynamic team.The ideal candidate will be passionate about building and maintaining complex systems, with a h...Show moreLast updated: 30+ days ago
    • Promoted
    Cloud Engineer II [T500-20908]

    Cloud Engineer II [T500-20908]

    McDonald'sHyderabad, Telangana, India
    One of the world’s largest employers with locations in more than 100 countries, McDonald’s Corporation has corporate opportunities in Hyderabad. Our global offices serve as dynamic innovation and op...Show moreLast updated: 15 days ago
    • Promoted
    Data Engineer III [T500-19720]

    Data Engineer III [T500-19720]

    McDonald'sHyderabad, Telangana, India
    About McDonald’s : One of the world’s largest employers with locations in more than 100 countries, McDonald’s Corporation has corporate opportunities in Hyderabad. Our global offices serve as dynami...Show moreLast updated: 18 days ago
    • Promoted
    Egen - Lead Infrastructure Engineer - Google Cloud Platform

    Egen - Lead Infrastructure Engineer - Google Cloud Platform

    SPRINGML INDIA DEVELOPMENT CENTER PRIVATE LIMITEDHyderabad
    Job title : Lead Infrastructure Engineer GCP Location : Hyderabad Exp : 10 -15 <...Show moreLast updated: 30+ days ago
    • Promoted
    (Urgent) Engineer, Data [T500-20293]

    (Urgent) Engineer, Data [T500-20293]

    TMUS Global SolutionsHyderabad, Telangana, India
    About T-Mobile : T-Mobile US, Inc.NASDAQ : TMUS), headquartered in Bellevue, Washington, is America's supercharged Un-carrier, connecting millions through its strong nationwide network and flagship ...Show moreLast updated: 17 days ago
    • Promoted
    Hiring for Azure Infra Engineer / Architect

    Hiring for Azure Infra Engineer / Architect

    Tata Consultancy ServicesHyderabad, Telangana, India
    Dear Tech Professional Greetings from Tata Consultancy Services (TCS) TCS has always been in the spotlight for being adept in “the next big technologies”. What we can offer you is a space to explo...Show moreLast updated: 18 days ago
    • Promoted
    Senior Cloud Infrastructure Engineer

    Senior Cloud Infrastructure Engineer

    Brace Infotech Private LtdHyderabad, Republic Of India, IN
    Designation / Title(Only immediate Joiner's).Years of Total experience in Information Technology.Hands on experience in Cloud Technologies (AWS and Azure). Hands on experience in Application Developme...Show moreLast updated: 17 days ago
    • Promoted
    Hiring for AWS Infra Engineer / Architect

    Hiring for AWS Infra Engineer / Architect

    Tata Consultancy ServicesHyderabad, Telangana, India
    Tata Consultancy Services (TCS).TCS has always been in the spotlight for being adept in “the next big technologies”.What we can offer you is a space to explore varied technologies and quench your t...Show moreLast updated: 17 days ago
    • Promoted
    Data Infrastructure Engineer

    Data Infrastructure Engineer

    TMUS Global SolutionsHyderabad, Republic Of India, IN
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 22 days ago
    • Promoted
    Engineer, Data [T500-20281]

    Engineer, Data [T500-20281]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 21 days ago
    • Promoted
    Engineer, Data [T500-20293]

    Engineer, Data [T500-20293]

    TMUS Global SolutionsHyderabad, Telangana, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 22 days ago
    • Promoted
    Cloud Infrastructure Engineer

    Cloud Infrastructure Engineer

    Tata Consultancy ServicesHyderabad, Republic Of India, IN
    Incident & Problem Management : Handle.L1 support, conduct root-cause analysis for incidents, and implement corrective actions. Experience using ITIL tools like Service Now.Troubleshoot and manage va...Show moreLast updated: 30+ days ago
    • Promoted
    Infrastructure Engineer

    Infrastructure Engineer

    Tekskills Inc.Hyderabad, Telangana, India
    We are seeking a seasoned Infrastructure Engineer with strong expertise in Oracle Linux Virtualization Manager (OLVM) , and a solid understanding of any IT Industry or systems.The ideal candid...Show moreLast updated: 7 days ago
    • Promoted
    Cloud Engineer Ii T500-20908

    Cloud Engineer Ii T500-20908

    McDonald'sHyderabad, Republic Of India, IN
    One of the world’s largest employers with locations in more than 100 countries, McDonald’s Corporation has corporate opportunities in Hyderabad. Our global offices serve as dynamic innovation and op...Show moreLast updated: 15 days ago
    • Promoted
    Lead Infrastructure Engineer - MFT / AWS

    Lead Infrastructure Engineer - MFT / AWS

    Talks About People HR SolutionsHyderabad
    Key Responsibilities : - Serve as an individual contributor and technical coach, leading and guiding the team.Provide consultancy and solutions to customers across various prod...Show moreLast updated: 30+ days ago