This job offer is not available in your country.

Senior Site Reliability Engineer - AI Research Clusters

ConfidentialGurgaon / Gurugram

30+ days ago

Job description

NVIDIA has continuously reinvented itself. Our invention of the GPU sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. Today, research in artificial intelligence is booming worldwide, which calls for highly scalable and massively parallel computation horsepower that NVIDIA GPUs excel. NVIDIA is a 'learning machine' that constantly evolves by adapting to new opportunities that are hard to solve, that only we can address, and that matter to the world. This is our life's work , to amplify human creativity and intelligence. As an NVIDIAN, you'll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join our diverse team and see how you can make a lasting impact on the world!

As a member of the GPU AI / HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that powers all AI research across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.

What You'll Be Doing

In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning AI-HPC GPU clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on performance at scale, real time monitoring, logging, and alerting.
Design and implement state-of-the-art GPU compute clusters.
Optimize cluster operations for maximum reliability, efficiency, and performance.
Drive foundational improvements and automation to enhance researcher productivity.
Troubleshoot, diagnose, and root cause of system failures and isolate the components / failure scenarios while working with internal & external partners.
Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
Practice sustainable incident response and blameless postmortems and Be part of an on-call rotation to support production systems
Write and review code, develop documentation and capacity plans, debug the hardest problems, live, on some of the largest and most complex systems in the world.
Implement remediations across software and hardware stack according to plan, while keeping a thorough procedural record and data log and Manage upgrades and automated rollbacks across all clusters.

What We Need To See

Bachelor's degree in computer science, Electrical Engineering or related field or equivalent experience with a minimum 5+ years of experience designing and operating large scale compute infrastructure.

Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 2K GPUs cluster.

Deep understanding of GPU computing and AI infrastructure.

Passion for solving complex technical challenges and optimizing system performance.

Experience with AI / HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm.

Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc.

In depth understating of container technologies like Docker, Enroot, etc.

Experience programming in Python and Bash scripting.

Ways To Stand Out From The Crowd

Interest in crafting, analyzing, and fixing large-scale distributed systems.

Familiarity with NVIDIA GPUs, Cuda Programming, NCCL, MLPerf benchmarking, InfiniBand with IBoIP and RDMA.

Experience with Cloud Deployment, BCM, Terraform.

Understanding of fast, distributed storage systems like Lustre and GPFS for AI / HPC workloads.

Multi-cloud experience.

Skills Required

Ansible, Python, Kubernetes, gpu computing

Create a job alert for this search

Senior Site Reliability Engineer • Gurgaon / Gurugram

Related jobs

Promoted
New!

Site Reliability Engineer

ExasoftDelhi, IN

Responsibilities and Requirements : .Experience must be at least 10+ years in SRE.Multi Cloud, Hybrid Cloud – on Data center sites. Experience with multiple operating systems (.Operating Systems, Kern...Show moreLast updated: 14 hours ago

Promoted

AI LLM Research Engineer (TTS, M-LLM, Agentic Workflow)

FlashIntelGhaziabad, IN

FlashIntel is seeking a dedicated and innovative Research Engineer with a focus on Multimodal Large Language Models (m-LLMs), Text-to-Speech (TTS) technologies, and agentic workflows.This position ...Show moreLast updated: 30+ days ago

Promoted

AI Research Engineer, RL

PebbleDelhi, IN

This is a full-time remote role for an AI Research Engineer specializing in Reinforcement Learning (RL).The AI Research Engineer will be responsible for developing and implementing state-of-the-art...Show moreLast updated: 30+ days ago

Promoted

Senior AI Engineer

Milestone Technologies, Inc.Ghaziabad, IN

AI-first, data-centric platform.You will implement agentic capabilities (intent, planner, router / composer), integrate knowledge-graph reasoning alongside a strong RAG baseline, and instrument robus...Show moreLast updated: 8 days ago

Promoted
New!

Senior Generative AI Engineer (Databricks,Data Lake)

AmpstekDelhi, IN

Title : Senior Generative AI Engineer (Databricks,Data Lake).Design and implement GenAI models (LLMs, multimodal, embeddings, and fine-tuning) for enterprise use cases. Architect and optimize data pi...Show moreLast updated: 14 hours ago

Promoted

Senior Site Reliability Engineer- ELK Expert

iVedha Inc.Delhi, IN

Senior Site Reliability Engineer (SRE) – ELK Expert | Platform Engineering Practice.Must be available to work in the EST (US / Canada) Time Zone. Are you a Senior Site Reliability Engineer (SRE) with ...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

XebiaDelhi, IN

AWS Engineer with strong Python development and Chaos Engineering expertise.The ideal candidate will combine cloud engineering, DevOps, and chaos experimentation to improve reliability, fault toler...Show moreLast updated: 26 days ago

Promoted

AI / ML Research Engineer

CredgeSol.aiDelhi, India, India

AI-driven automation, analytics, and intelligence to redefine business success.We provide tailored AI-powered innovations to solve real-world business challenges. Our mission is to empower businesse...Show moreLast updated: 30+ days ago

Promoted
New!

Senior Generative AI Engineer (Databricks Data Lake)

AmpstekGhaziabad, IN

Title : Senior Generative AI Engineer (Databricks Data Lake).We are seeking an experienced Senior Generative AI Engineer with a strong background in Databricks and data lake architectures.This indiv...Show moreLast updated: 14 hours ago

Promoted
New!

STEM Researcher - 39521

TuringDelhi, IN

Pay : $50+ / hour (based on role & expertise).You’ll work on projects that fine-tune large language models (like ChatGPT) using your subject expertise and strong analytical skills.This role is ideal f...Show moreLast updated: 14 hours ago

Promoted
New!

Site Reliability Engineer

BayOne Solutionsgurugram, uttar pradesh, in

Role : Site Reliability Engineer.The CXE Site Reliability Engineering (SRE) team manages the CI / CD pipelines and cloud infrastructure, ensuring seamless deployment, monitoring, and maintenance.Howev...Show moreLast updated: 10 hours ago

Promoted
New!

Full Stack AI engineer

AnswerThis (YC F25)Delhi, IN

Remote (Applications open worldwide).Semantic Search, Vector Databases, Prompt Engineering, GenAI Frameworks, React Agents, Graph Agents, Document Parsing, Python, Scalable APIs.AnswerThis is an AI...Show moreLast updated: 14 hours ago

Promoted

Senior MLOps Engineer

Mitchell Martin Inc.Delhi, IN

Include, but are not limited to, the following : .Own productionizing models—from tracked experiments to governed releases—ensuring resilient services with clear SLOs, runbooks, and fast, safe rollba...Show moreLast updated: 20 days ago

Promoted

Senior Machine Learning Engineer

Elife TransferDelhi, IN

A fast-growing start-up headquartered in San Francisco, CA, USA in the heart of Silicon Valley.We recruit worldwide as our customer base is global. Reliable ground transportation provider, any type ...Show moreLast updated: 25 days ago

Promoted

AI Exploration Engineer

Mitchell Martin Inc.Ghaziabad, IN

Design and execute machine learning experiments to evaluate emerging AI technologies and frameworks.Prototype and assess end-to-end AI solutions to inform product and platform strategy.Formulate hy...Show moreLast updated: 20 days ago

Promoted

Site Reliability Engineer

ConcordDelhi, IN

Engineers (Individual Contributors).Strong SRE (Site Reliability Engineering).CI / CD, monitoring, automation, infrastructure as code, etc.Show moreLast updated: 18 days ago

Promoted

Senior AI Engineer

ValueMomentumGhaziabad, IN

We are seeking an experienced AI / NLP Engineer to join our team.The ideal candidate will have expertise in working with large language models and AI-based tools, strong analytical skills, and experi...Show moreLast updated: 17 days ago

Promoted

Agentic AI Engineer

InterVision SystemsDelhi, IN

We are looking for an innovative.You’ll be working at the forefront of AI, where you will architect and implement intelligent agents that can autonomously perform complex tasks, make decisions, and...Show moreLast updated: 8 days ago

Promoted

Site Reliability Engineer

UplersDelhi, IN

Uplers is hiring for one of the clients.SRE (Oracle Cloud Infrastructure).Remote | Mon–Fri | 10 : 30 AM – 7 : 30 PM IST.Use of personal device required. OCI cloud infrastructure using Terraform and GitL...Show moreLast updated: 24 days ago

Promoted

SME - Generative AI Foundations

LearningMateDelhi, IN

We are seeking a knowledgeable and detail-oriented.CCS Generative AI Foundations certification.The SME will work collaboratively with a cross-functional team to create clear, standards-aligned asse...Show moreLast updated: 26 days ago