Talent.com
This job offer is not available in your country.
Senior Site Reliability Engineer - GPU Cloud

Senior Site Reliability Engineer - GPU Cloud

ConfidentialBengaluru / Bangalore, India
9 days ago
Job description

NVIDIA has been a pioneer in Accelerated Computing and has been paving the way with innovations in Generative AI, Large Language Model (LLM), Autonomous Vehicles, Robotics, High-Performance Computing (HPC), Gaming / Visualization, and Edge / Data Center / Cloud Computing. NVIDIA provides automakers, research institutions, cloud providers, large companies and start-ups the power and flexibility to develop and deploy breakthrough artificial intelligence systems.

We are a fast paced, dynamic and dedicated Site Reliability Engineering (SRE) team serving the forefront of the latest science and technology trends on cloud and on-prem infrastructure management for High-Performance & Distributed Computing. Working closely with the development teams, we provide hosted solutions for our internal and external customers. Are you passionate about infrastructure and enjoy working on and resolving intricate multi-faceted issues Are you eager to have your hands on the engines of the next generation of cloud services Do you get a buzz from identifying and eliminating toil, designing and coding innovative solutions that address the needs of a whole organization If so, read on and give us a shout.

What you'll be doing :

The NVIDIA GPU cloud is a hosted platform for internal R&D teams and external AI / ML stack customers. This SRE team is accountable for the setup, management, reliability and availability of this infrastructure spanning 1000s of GPU nodes. As a senior SRE, you are responsible for :

Providing scalable and robust service oriented infrastructure automation, monitoring and analytics solutions for NVIDIA's on-prem and cloud based GPU infrastructure.

You will own the whole life cycle of new tools and services - from requirements gathering, to design documentation, validation and deployment.

Provide customer support on a rotation basis.

What we need to see :

Minimum of 8 years of experience ce in automating and handling large-scale distributed system software deployments in on-prem / cloud environments.

Proficiency in any language - Go / Python / Perl / C++ / Java / C.

Strong command on terraform, Kubernetes and cloud infra administration.

Excellent debugging and troubleshooting skills.

Ability to design simple and reliable systems that can work without much support.

Outstanding teammate who can collaborate and influence in a multifaceted environment.

Excellent interpersonal, and written communication skills.

M.Sc or B.E in Computer Science or a related technical field involving coding (e.g., physics or mathematics)

Ways to stand out from the crowd :

Ability to decompose complex requirements into simple tasks and reuse available solutions to implement most of those.

Proven record of maintaining platform SLAs through accurate resolutions.

Unit testing and benchmarking are an integral part of your code.

Ability to reason and choose the best possible algorithm to meet scaling and availability challenges.

NVIDIA is widely considered to be one of the technology world's most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you!

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Skills Required

Terraform, C, Java, Kubernetes, Python, Perl, Go

Create a job alert for this search

Senior Site Reliability Engineer • Bengaluru / Bangalore, India

Related jobs
  • Promoted
Senior Site Reliability Engineer- ELK Expert

Senior Site Reliability Engineer- ELK Expert

iVedha Inc.bangalore, karnataka, in
Senior Site Reliability Engineer (SRE) – ELK Expert | Platform Engineering Practice.Must be available to work in the EST (US / Canada) Time Zone. Are you a Senior Site Reliability Engineer (SRE) with ...Show moreLast updated: 30+ days ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

Amicon Hub Serviceshosur, tamil nadu, in
Manage and scale production systems hosted on.Automate operational tasks using.Improve system reliability and reduce manual interventions through automation. Collaborate with development teams to en...Show moreLast updated: 6 days ago
  • Promoted
  • New!
Site Reliability Engineer

Site Reliability Engineer

BayOne Solutionshosur, tamil nadu, in
Role : Site Reliability Engineer.The CXE Site Reliability Engineering (SRE) team manages the CI / CD pipelines and cloud infrastructure, ensuring seamless deployment, monitoring, and maintenance.Howev...Show moreLast updated: 1 hour ago
  • Promoted
Site Reliability Engineer - Cloud Platforms

Site Reliability Engineer - Cloud Platforms

LanceSoft, IncBangalore
Role and Responsibilities : Reporting to Engineering, the Site Reliability Engineer will play a critical role in driving innovation and growth for the Banking Soluti...Show moreLast updated: 18 days ago
  • Promoted
Signify - Senior Site Reliability Engineer - AWS Cloud

Signify - Senior Site Reliability Engineer - AWS Cloud

SIGNIFY INNOVATIONS INDIA LIMITEDBangalore
About the job : About Signify : Through bold discovery and cutting-edge innovation, we lead an industry that i...Show moreLast updated: 4 days ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

Core Minds Tech SOlutionsHosur
Job Description : - Engage with our product teams to understand requirements, design, and implement resilient and scalable infrastructure solutions&l...Show moreLast updated: 30+ days ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

WSO2Bengaluru, Karnataka, India
Founded in 2005, WSO2 is the largest independent software vendor providing open-source API management, integration, and identity and access management (IAM) to thousands of enterprises in over 90 c...Show moreLast updated: 30+ days ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

XebiaBengaluru, IN
AWS Engineer with strong Python development and Chaos Engineering expertise.The ideal candidate will combine cloud engineering, DevOps, and chaos experimentation to improve reliability, fault toler...Show moreLast updated: 26 days ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

TavantBengaluru, Karnataka, India
With 25+ years of experience building innovative digital products and solutions, Tavant provides impactful results to its customers. It has been the frontrunner in driving digital innovation and tec...Show moreLast updated: 26 days ago
  • Promoted
  • New!
Site Reliability Engineer

Site Reliability Engineer

ExasoftBangalore, IN
Responsibilities and Requirements : .Experience must be at least 10+ years in SRE.Multi Cloud, Hybrid Cloud – on Data center sites. Experience with multiple operating systems (.Operating Systems, Kern...Show moreLast updated: 5 hours ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

WhiteLotus Talent PartnersBengaluru, Karnataka, India
L0 and L1 Site Reliability Engineer (SRE) Support.Krutrim Cloud Site Reliability operations team and ensure the smooth functioning of our cloud infrastructure powered by. In this role, you will focu...Show moreLast updated: 30+ days ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

Uplershosur, tamil nadu, in
Uplers is hiring for one of the clients.SRE (Oracle Cloud Infrastructure).Remote | Mon–Fri | 10 : 30 AM – 7 : 30 PM IST.Use of personal device required. OCI cloud infrastructure using Terraform and GitL...Show moreLast updated: 24 days ago
  • Promoted
Senior Site Reliability Engineer [T500-20117]

Senior Site Reliability Engineer [T500-20117]

Delta Air LinesBengaluru, Karnataka, India
Delta Air Lines (NYSE : DAL) is the U.Powered by our employees around the world, Delta has for a decade led the airline industry in operational excellence while maintaining our reputation for award-...Show moreLast updated: 19 days ago
  • Promoted
ThoughtSpot - Senior System Reliability Engineer I - Cloud Infrastructure

ThoughtSpot - Senior System Reliability Engineer I - Cloud Infrastructure

THOUGHTSPOT INDIA PRIVATE LIMITEDBangalore
About The Role : ThoughtSpot is an AI-powered analytics platform that enables users to explore and analyze data through natural language queries, making insights acce...Show moreLast updated: 30+ days ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

ViewSonicBengaluru, Karnataka, India
At ViewSonic Technologies, we’re passionate about building software that solves problems.We count on our site reliability engineers (SREs) to empower users with a rich feature set, high availabilit...Show moreLast updated: 30+ days ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

EmbarkGCCBengaluru, Karnataka, India
Senior Site Reliability Engineer (SRE) – Job Description.Implement and tune SLOs / SLIs, build reliability dashboards, and respond to incidents using Grafana IRM, JSM, and escalation workflows.Monito...Show moreLast updated: 26 days ago
  • Promoted
Site Reliability Engineer - Chaos Management

Site Reliability Engineer - Chaos Management

Xebiahosur, tamil nadu, in
AWS Engineer with strong Python development and Chaos Engineering expertise.The ideal candidate will combine cloud engineering, DevOps, and chaos experimentation to improve reliability, fault toler...Show moreLast updated: 7 days ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

AllegionBengaluru, Karnataka, India
Allegion India is seeking a highly motivated Senior Site Reliability Engineer who will play a critical role in ensuring the reliability, scalability, and performance of our organization's systems a...Show moreLast updated: 30+ days ago
  • Promoted
ALLEGION - Senior Site Reliability Engineer - Terraform / Kubernetes

ALLEGION - Senior Site Reliability Engineer - Terraform / Kubernetes

ALLEGION INDIA PRIVATE LIMITEDBangalore
About the role Allegion India is seeking a highly motivated Sr.Site Reliability Engineer on contract for 6 months who will play a critical role in ensuring the reliab...Show moreLast updated: 16 days ago
  • Promoted
Sr. Site Reliability Engineer [T500-20179]

Sr. Site Reliability Engineer [T500-20179]

Delta Air LinesBengaluru, India
Delta Air Lines (NYSE : DAL) is the U.Powered by our employees around the world, Delta has for a decade led the airline industry in operational excellence while maintaining our reputation for award-...Show moreLast updated: 5 days ago