Talent.com
GPU Infrastructure & Data Center Engineer

GPU Infrastructure & Data Center Engineer

PhoQtek labsHyderabad, Telangana, India
4 days ago
Job description

About the Role

We are seeking a highly skilled IT Solutions & GPU Infrastructure Lead to take complete ownership of our GPU-based server infrastructure. This role focuses on next-generation GPU systems used for AI / ML workloads, covering every aspect from data center colocation and setup to GPU slicing, MIG management, resource allocation, optimization, and compliance. You will lead the end-to-end lifecycle of GPU infrastructure — ensuring all servers are optimized, secure, and production-ready for both internal and customer use.

Key Responsibilities

Colocation & Infrastructure Setup

GPU colocation and end-to-end infrastructure setup will be entirely under your ownership and responsibility.

Coordinate with data centers for rack installation, power, and cooling.

Deploy and configure GPU-based servers for production readiness.

2. GPU & AI / ML Infrastructure

Manage GPU slicing and MIG (Multi-Instance GPU) for multi-tenant workloads.

Install and maintain the NVIDIA software stack — CUDA, cuDNN, NCCL, and DCGM.

Optimize GPU infrastructure for AI / ML workloads (TensorFlow, PyTorch, RAPIDS).

Support multi-GPU scaling using NVLink and PCIe passthrough.

3. Systems & Virtualization

Administer Linux-based environments (Ubuntu, CentOS, Rocky) along with other environments.

Manage virtualization platforms such as VMware, KVM, or Proxmox with GPU passthrough.

Handle container orchestration with Docker and Kubernetes GPU Operators.

Integrate high-performance storage (NFS, Ceph, SAN / NAS) for large-scale datasets.

4. Monitoring & Performance Optimization

Monitor GPU and system performance using Prometheus, Grafana, NVIDIA DCGM, and nvidia-smi.

Proactively detect, analyze, and resolve GPU or system bottlenecks.

Optimize GPU nodes for training and inference performance.

Implement structured logging, alerts, and usage reporting.

one should have to administer, manage, monitor and maintain GPU infrastructure for AI workloads.

5. Security & Compliance

Harden GPU servers for multi-tenant workloads.

Manage driver, firmware, and software license compliance.

Ensure infrastructure security and audit readiness with periodic patching and updates.

6. Networking & High-Performance I / O

Configure and maintain high-speed network fabrics (InfiniBand, RDMA, RoCE).

Optimize low-latency interconnects for distributed GPU workloads.

Troubleshoot and enhance data transfer performance.

7. Customer & Infrastructure Ownership

Serve as the primary contact for GPU resource allocation.

Provision GPU slices or MIG instances for internal and external teams.

Troubleshoot, document, and optimize workload performance.

Qualifications

Proven experience in data center server setup and colocation.

Deep expertise in GPU server administration (NVIDIA A100 / H100 or equivalent).

Strong working knowledge of GPU slicing, MIG, CUDA, NCCL, and NVIDIA drivers.

Experience with Linux administration, virtualization (VMware / KVM / Proxmox), and containers (Docker / Kubernetes).

Hands-on experience with AI / ML frameworks such as TensorFlow and PyTorch.

Familiarity with monitoring tools (Prometheus, Grafana, DCGM).

Knowledge of storage systems (NFS, Ceph) and high-performance networking.

Strong vendor coordination and infrastructure management skills.

Why This Role Matters

This position owns the entire lifecycle of GPU-based infrastructure — from colocation to slicing, monitoring, and optimization. You will build and maintain the backbone of our AI / ML infrastructure, ensuring that all systems are efficient, scalable, and production-grade.

Create a job alert for this search

Infrastructure Engineer • Hyderabad, Telangana, India

Related jobs
  • Promoted
GPU Infrastructure & Data Center Engineer

GPU Infrastructure & Data Center Engineer

PhoQtek labsHyderabad, Telangana, India
We are seeking a highly skilled IT Solutions & GPU Infrastructure Lead to take complete ownership of our GPU-based server infrastructure. This role focuses on next-generation GPU systems used for AI...Show moreLast updated: 4 days ago
  • Promoted
Cloud Engineer

Cloud Engineer

Response InformaticsHyderabad, Telangana, India
EC2 (Elastic Compute Cloud) Lambda Elastic Beastalk ECS EKS Light Sail AWS Batch Outposts AWS Egate AWS Fargate Compute Optimizer S3 EBS EFS Fsx Glacier Storage Gateway BACKUP Snow Family RDS Auror...Show moreLast updated: 30+ days ago
  • Promoted
Delinea Server Suite / Centrify / AD Bridging

Delinea Server Suite / Centrify / AD Bridging

Randstad EnterpriseHyderabad, India
Key Responsibilities : Experience level : 7+yrs.Location : Hyderabad & Bangalore.Madidate Skill : Delinea Server Suite OR Centrify. Manage the day-to-day operations of the Delinea Active Dire...Show moreLast updated: 16 days ago
  • Promoted
  • New!
▷ Apply in 3 Minutes : GPU Infrastructure & Data Center Engineer

▷ Apply in 3 Minutes : GPU Infrastructure & Data Center Engineer

PhoQtek labsHyderabad, Telangana, India
We are seeking a highly skilled IT Solutions & GPU Infrastructure Lead to take complete ownership of our GPU-based server infrastructure. This role focuses on next-generation GPU systems used for AI...Show moreLast updated: 3 hours ago
  • Promoted
Data Centre Linux & HW Engineer, India, HYD-Infinity - DCO

Data Centre Linux & HW Engineer, India, HYD-Infinity - DCO

AmazonHyderabad, Telangana, India
This job is with Amazon, an inclusive employer and a member of myGwork – the largest global platform for the LGBTQ+ business community. Please do not contact the recruiter directly.DESCRIPTION : AWS ...Show moreLast updated: 3 days ago
  • Promoted
Cloud Engineer II [T500-20908]

Cloud Engineer II [T500-20908]

McDonald'sHyderabad, Telangana, India
One of the world’s largest employers with locations in more than 100 countries, McDonald’s Corporation has corporate opportunities in Hyderabad. Our global offices serve as dynamic innovation and op...Show moreLast updated: 13 days ago
  • Promoted
Infrastructure Engineer

Infrastructure Engineer

Tekskills Inc.Hyderabad, Telangana, India
Oracle Linux Virtualization Manager (OLVM).The ideal candidate will be responsible for designing, implementing, and maintaining robust and scalable infrastructure solutions that support telecom-gra...Show moreLast updated: 5 days ago
  • Promoted
Accountant

Accountant

MNR UniversitySangareddi, Telangana, India
Position : Accountant & Cashier Location : HERA Campus Vacancies : 3 Qualification : Graduate (B.Com or equivalent) Experience : Minimum 5 years in Accounts and Cash Handling Salary Range : ...Show moreLast updated: 4 days ago
  • Promoted
Data Engineer III [T500-20688]

Data Engineer III [T500-20688]

McDonald'sHyderabad, India
One of the world’s largest employers with locations in more than 100 countries, McDonald’s Corporation has corporate opportunities in Hyderabad. Our global offices serve as dynamic innovation and op...Show moreLast updated: 26 days ago
  • Promoted
  • New!
▷ [15h Left] GPU Infrastructure & Data Center Engineer

▷ [15h Left] GPU Infrastructure & Data Center Engineer

PhoQtek labsHyderabad, Telangana, India
About the Role We are seeking a highly skilled IT Solutions & GPU Infrastructure Lead to take complete ownership of our GPU-based server infrastructure. This role focuses on next-generation GPU sys...Show moreLast updated: 2 hours ago
  • Promoted
Data Engineer III [T500-19720]

Data Engineer III [T500-19720]

McDonald'sHyderabad, Telangana, India
One of the world’s largest employers with locations in more than 100 countries, McDonald’s Corporation has corporate opportunities in Hyderabad. Our global offices serve as dynamic innovation and op...Show moreLast updated: 15 days ago
  • Promoted
Engineer, Data [T500-20281]

Engineer, Data [T500-20281]

TMUS Global SolutionsHyderabad, Telangana, India
NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 20 days ago
  • Promoted
Hiring for AWS Infra Engineer / Architect

Hiring for AWS Infra Engineer / Architect

Tata Consultancy ServicesHyderabad, Telangana, India
Tata Consultancy Services (TCS).TCS has always been in the spotlight for being adept in “the next big technologies”.What we can offer you is a space to explore varied technologies and quench your t...Show moreLast updated: 16 days ago
  • Promoted
Power System Engineer

Power System Engineer

Tata Consultancy ServicesHyderabad, India
Role : Power Systems Engineer - Integrator Solution.Desired Experience Range - 5-15 years.Location of Requirement- Hyderabad, Chennai, Noida. TECH / PHD - In Electrical Engineering.Must have worked on ...Show moreLast updated: 30+ days ago
  • Promoted
Hiring for Azure Infra Engineer / Architect

Hiring for Azure Infra Engineer / Architect

Tata Consultancy ServicesHyderabad, Telangana, India
Dear Tech Professional Greetings from Tata Consultancy Services (TCS) TCS has always been in the spotlight for being adept in “the next big technologies”. What we can offer you is a space to explo...Show moreLast updated: 16 days ago
  • Promoted
Data Infrastructure Engineer

Data Infrastructure Engineer

TMUS Global SolutionsHyderabad, Republic Of India, IN
NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 20 days ago
  • Promoted
Cloud Infrastructure Engineer

Cloud Infrastructure Engineer

Tata Consultancy ServicesHyderabad, Republic Of India, IN
Incident & Problem Management : Handle.L1 support, conduct root-cause analysis for incidents, and implement corrective actions. Experience using ITIL tools like Service Now.Troubleshoot and manage va...Show moreLast updated: 30+ days ago
  • Promoted
Engineer, Data [T500-20293]

Engineer, Data [T500-20293]

TMUS Global SolutionsHyderabad, Telangana, India
About T-Mobile : T-Mobile US, Inc.NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship b...Show moreLast updated: 20 days ago