Talent.com
Senior Site Reliability Engineer

Senior Site Reliability Engineer

ConfidentialDelhi, India
4 days ago
Job description

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It's a unique legacy of innovation that's fueled by great technology—and amazing people. Today, we're tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what's never been done before takes vision, innovation, and the world's best talent. As an NVIDIAN, you'll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.

NVIDIA is looking for a passionate member to join our DGX Cloud Engineering Team as a Sr. Site Reliability Engineer. In this role, you will play a significant part in helping to craft and guide the future of AI & GPUs in the Cloud. NVIDIA DGX Cloud is a cloud platform tailored for AI tasks, enabling organizations to transition AI projects from development to deployment in the age of intelligent AI. Are you passionate about cloud software development and strive for quality Do you pride yourself in building cloud-scale software systems If so, join our team at NVIDIA, where we are dedicated to delivering GPU-powered services around the world!

What You'll Be Doing

You will play a crucial role in ensuring the success of the Omniverse on DGX Cloud platform by helping to build our deployment infrastructure processes, creating world-class SRE measurement and creating automation tools to improve efficiency of operations, and maintaining a high standard of perfection in service operability and reliability.

  • Design, build, and implement scalable cloud-based systems for PaaS / IaaS.
  • Work closely with other teams on new products or features / improvements of existing products.
  • Develop, maintain and improve cloud deployment of our software.
  • Participate in the triage & resolution of complex infra-related issues
  • Collaborate with developers, QA and Product teams to establish, refine and streamline our software release process, software observability to ensure service operability, reliability, availability.
  • Maintain services once live by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces
  • Develop, maintain and improve automation tools that can help improve efficiency of SRE operations
  • Practice balanced incident response and blameless postmortems
  • Be part of an on-call rotation to support production systems

What We Need To See

  • BS or MS in Computer Science or equivalent program from an accredited University / College.
  • 8+ years of hands-on software engineering or equivalent experience.
  • Demonstrate understanding of cloud design in the areas of virtualization and global infrastructure, distributed systems, and security.
  • Expertise in Kubernetes (K8s) & KubeVirt and building RESTful web services.
  • Understanding of building AI Agentic solutions preferably Nvidia open source AI solutions. Demonstrate working experiences in SRE principles like metrics emission for observability, monitoring, alerting using logs, traces and metrics
  • Hands on experience working with Docker, Containers and Infrastructure as a Code like terraform deployment CI / CD.
  • Exhibit knowledge in concepts of working with CSPs, for example : AWS (Fargate, EC2, IAM, ECR, EKS, Route53 etc...), Azure etc.
  • Ways To Stand Out From The Crowd

  • Expertise in technologies such as Stack-storm, OpenStack, Redhat OpenShift, AI DBs like Milvus.
  • A track record of solving complex problems with elegant solutions.
  • Prior experience with Go & Python, React.
  • Demonstrate delivery of complex projects in previous roles.
  • Showcase ability in developing Frontend application with concepts of SSA, RBAC
  • We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status. We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.

    JR2000387

    Skills Required

    K8S, Route53, Restful Web Services, ECR, Ec2, Docker, Terraform, Iam, Azure, Kubernetes, Aws

    Create a job alert for this search

    Senior Site Reliability Engineer • Delhi, India

    Related jobs
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    o9 Solutions, Inc.narela, delhi, in
    Be part of something revolutionary.At o9 Solutions, our mission is clear : be the Most Valuable Platform (MVP) for enterprises. With our AI-driven platform — the o9 Digital Brain — we integrate globa...Show moreLast updated: 22 days ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    CareerUS SolutionsDelhi, Delhi, India
    Position Overview : The Site Reliability Engineer (SRE) is responsible for ensuring the stability, scalability, performance, and reliability of production systems and services.This role bridges so...Show moreLast updated: 18 hours ago
    • Promoted
    Manager, Site Reliability Engineering

    Manager, Site Reliability Engineering

    Cventgurugram, uttar pradesh, in
    Cvent is looking for a Manager, Site Reliability Engineering to help us scale our systems and ensure stability, reliability and performance and rapid deployments of our platform.We build teams that...Show moreLast updated: 7 days ago
    • Promoted
    Senior Site Reliability Engineer-III

    Senior Site Reliability Engineer-III

    ConfidentialGurgaon / Gurugram
    Define and enforce SLOs, SLIs, and error budgets across microservices.Architect an observability stack (metrics, logs, traces) and derive operational insights. Automate toil and manual operations th...Show moreLast updated: 4 days ago
    • Promoted
    Senior Site Reliability Engineer- ELK Expert

    Senior Site Reliability Engineer- ELK Expert

    iVedha Inc.Delhi, IN
    Senior Site Reliability Engineer (SRE) – ELK Expert | Platform Engineering Practice.Must be available to work in the EST (US / Canada) Time Zone. Are you a Senior Site Reliability Engineer (SRE) with ...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Senior Site Reliability Engineer (SRE) – Datadog Observability

    Senior Site Reliability Engineer (SRE) – Datadog Observability

    Jade Globalfaridabad, haryana, in
    Senior Site Reliability Engineer (SRE) – Datadog Observability.SRE and Infrastructure Operations with minimum 3.Hyderabad preferable but open for Pune and remote. Site Reliability Engineer (SRE).SRE...Show moreLast updated: 20 hours ago
    • Promoted
    Senior Site Reliability Engineer (SRE)

    Senior Site Reliability Engineer (SRE)

    Tata Consultancy ServicesDelhi, India
    Senior Site Reliability Engineer (SRE) Required Technical Skill Set : .Senior Site Reliability Engineer (SRE) Desired Experience Range : 7 - 10 yrs Notice Period : Immediate to 90Days only Location of ...Show moreLast updated: 11 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    CodeKarmagurgaon, haryana, in
    Site Reliability Engineer (Multi-Cloud Deployments).CodeKarma is redefining how engineering teams understand and evolve complex systems — bringing production context directly into the developer’s w...Show moreLast updated: 21 days ago
    • Promoted
    SITA - Senior / Lead Site Reliability Engineer

    SITA - Senior / Lead Site Reliability Engineer

    SITA INFORMATION NETWORKING COMPUTING INDIADelhi
    About the job : WELCOME TO SITA : We're the team that keeps airports moving, airlines flying smoothly, and borders open.Our tech and communi...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    IntraEdgeMeerut, IN
    Strong leadership and people management skills.Exceptional technical proficiency in Pearson's technology stack.Strategic thinking with a focus on long-term operational excellence.Champion operation...Show moreLast updated: 13 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    CapgeminiDelhi, IN
    Choosing Capgemini means choosing a company where you will be empowered to shape your career in the way you’d like, where you’ll be supported and inspired by a collaborative community of colleagues...Show moreLast updated: 10 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    SFS Group India Pvt. Ltd.Delhi, India
    Objectives Act as the Site Reliability Engineer for global operations, ensuring system stability, scalability, and efficiency through advanced automation, observability, and proactive infrastructur...Show moreLast updated: 17 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    SynechronDelhi, India
    Good-day, We have immediate opportunity for Senior Site Reliability Engineer.Senior Site Reliability Engineer Job Location : Synechron. Notice : Immediate Joiner About Company : At Synechron, we belie...Show moreLast updated: 27 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ConfidentialGurugram, Gurgaon / Gurugram, India
    Grade Level (for internal use) : .S&P Global provides innovative products and services that enhance transparency, reduce risk, and improve operational efficiency. Our customers include banks, hedge fu...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Xomiro TechnologiesDelhi, IN
    Remote
    Description : Role : Site Reliability Engineer (SRE) Location : Remote-First - (Bangalore)(Hybrid : Rare O...Show moreLast updated: 14 days ago
    • Promoted
    • New!
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Nebula Tech Solutionsghaziabad, uttar pradesh, in
    SRE team supporting mission-critical applications for our.We’re now looking for engineers who can go beyond operations — those who can. Enhance application reliability through code.Add or modify cod...Show moreLast updated: 20 hours ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    ITC InfotechDelhi, India
    Must-Have Requirements Experience : .SRE and / or DevOps roles Programming Skills : .Proficiency in at least one coding language — preferably. Experience supporting and enhancing.AI Platform services Auto...Show moreLast updated: 19 days ago
    • Promoted
    Site Reliability Engineer / Lead Site Reliability Engineer

    Site Reliability Engineer / Lead Site Reliability Engineer

    ConfidentialNoida, India
    BOLD is seeking professionals who will be responsible for performing the build and release activities with Microsoft Technology stack. This person will also manage CI / CD pipelines and automate the b...Show moreLast updated: 30+ days ago