Talent.com
This job offer is not available in your country.
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Nvidia Graphics Pvt LtdINDIA
30+ days ago
Job description

NVIDIA has been redefining computer graphics, PC gaming, and accelerated computing for more than 25 years. Its a unique legacy of innovation thats motivated by outstanding technology and amazing people. Today, were tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. NVIDIA is at the forefront of generative AI models, from language to images. Doing whats never been done before takes vision, innovation, and the worlds best talent. As an NVIDIAN, youll be immersed in a diverse, encouraging environment where everyone is inspired to do their best work.

NVIDIA is looking for a Senior Site Reliability Engineer (SRE) to join its cloud service team for supporting, triaging, and building generative AI-powered visual applications. As SREs are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. We live SRE practices that are key to product quality, such as limiting time spent on reactive operational work, blameless postmortems, proactive identification of potential outages, and iterative improvements, which all make for exciting and multifaceted day-to-day work. The person in this position will be responsible for Service Response and workflow and will drive tools / service development to maintain and improve service SLOs. We partner with Service Owners to drive the reliability of the service.

What you will be doing :

  • Support and work on groundbreaking Generative AI inferencing workloads running in a globally-distributed heterogeneous environment spanning all major cloud service providers. Ensure the best possible performance and availability on current and next-generation GPU architectures.
  • Collaborate closely with the service owner, architecture, research, and tools teams at NVIDIA to achieve ideal results for AI problems at hand.
  • Monitoring & supporting critical high-performance, large-scale services running multi-cloud.
  • Participate in the triage & resolution of sophisticated infra-related issues.
  • Maintain services once live by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces.
  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
  • Practice balanced incident response and blameless postmortems.
  • Be part of an on-call rotation to support production systems and lead significant production improvement around tooling, automation, and process.
  • Architect, design, and code using your expertise to optimize, deploy and productize services.

What we need to see :

  • 8 years of experience operating & owning end-to-end availability and performance of mission-critical services in a live-site production environment, either as an SRE or Service Owner.
  • 3 years executing incident management and participating in an on call shift.
  • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience
  • Solid understanding of containerization and microservices architecture, K8s. Excellent understanding of the Kubernetes ecosystem and best practices with K8s.
  • Ability to dissect complex problems into simple sub-problems and use available solutions to resolve them.
  • Technical leadership beyond development that includes scoping, requirements capturing, leading and influencing multiple teams of engineers on broad development initiatives.
  • Lead significant production activities, including change management, post-mortem reviews, workflow processes, software design, and delivering software automation in various languages (Python, or Go ) and technologies (CI / CD auto-remediation, alert correlation).
  • Best in understanding SLO / SLIs, error budgeting, KPIs, and configuring for highly sophisticated services.
  • Experience with the ELK and Prometheus stacks as a power user and administrator.
  • Excellent understanding of cloud environments and technologies, especially AWS, Azure, GCP, or OCI.
  • Proven strengths in identifying, mitigating, and root-causing issues while continuously seeking ways to drive optimization, efficiency, and the bottom line.
  • Ways to stand out from the crowd :

  • Exposure to containerization and cloud-based deployments for AI models.
  • Excellent coding : Python, Go (Any similar language).
  • Prior experience driving production issues and helping with on-call support and understanding of Deep Learning / Machine Learning / AI.
  • Experience with Cuda, PyTorch, TensorRT, TensorFlow, and / or Triton as well as experience with StackStorm and similar automation platforms is a bonus.
  • Understanding of observability instrumentation techniques and best practices, including OpenTelemetry.
  • NVIDIA is widely considered to be one of the technology worlds most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you.

    Locations : India, Bengaluru

    Create a job alert for this search

    Senior Site Reliability Engineer • INDIA

    Related jobs
    • Promoted
    • New!
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Synopsys IncIndia
    Site Reliability Engineering, Sr Staff.The Engineering Excellence Group drives innovation velocity and enterprise infrastructure automation, which are critical elements of our growth and scaling st...Show moreLast updated: 15 hours ago
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    OnitPune, Maharashtra, IN
    Quick Apply
    Role : Senior Site Reliability Engineer Location : Pune Onit, Inc.Site Reliability Engineer L2 to join our Core Infrastructure team. This role will help to ensure the reliability of a diverse s...Show moreLast updated: 16 days ago
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    ScaleneWorksBengaluru, Karnataka, India
    Quick Apply
    Experience in C++ / Java : if one of the two it is ok.Knowledge of cloud would be appreciated.Knowledge of software development life cycle : nice to have. Has working experience and advanced and speci...Show moreLast updated: 30+ days ago
    • Promoted
    Senior DevOps Engineer - Site Reliability

    Senior DevOps Engineer - Site Reliability

    DashhirePune
    We are looking for a skilled and experienced Senior DevOps Engineer to lead the design, implementation, and management of our CI / CD infrastructure, cloud operations, and automation frameworks.You w...Show moreLast updated: 16 days ago
    • Promoted
    • New!
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Futran SolutionsPune, Maharashtra, India
    Designation - Sr System Reliability Engineer (Application Support + Automation).Oversee and manage all aspects of the production environment. Define and implement strategies for monitoring, performa...Show moreLast updated: 15 hours ago
    • Promoted
    • New!
    Senior Site Reliability Engineer - DevOps

    Senior Site Reliability Engineer - DevOps

    QualysPune, Maharashtra, India
    We are seeking a highly motivated and talented Senior Site Reliability Engineer to work on Qualys’ Cloud Platform & Middleware technologies. Working with a team of engineers and architects, you will...Show moreLast updated: 15 hours ago
    • Promoted
    Senior Staff Site Reliability Engineer

    Senior Staff Site Reliability Engineer

    Palo Alto NetworksIndia
    Our Mission At Palo Alto Networks® everything starts and ends with our mission : Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer - DevOps

    Senior Site Reliability Engineer - DevOps

    TalentVeda Recruitment ServicesPune
    Role & Responsibilities : - Setup and maintain devops tools - Deploy updates and fixesShow moreLast updated: 27 days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    AqileaBangalore
    Job Title : Site Reliability Engineer Experience : 5 to 9 Years Location : Bangalore< / p&...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    ADPChennai, Tamil Nadu, India
    ADP is hiring Senior Site Reliability Engineer!.In ADP, we’re building the next generation of technologies.Our mission is simple : Create powerful solutions that are efficient, intuitive, beautiful,...Show moreLast updated: 15 hours ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Mancer Consulting ServicesIndia
    Your skills & Your Experience : A Chief-Level Software Engineering role with 15+ years of industry experience, partnering with senior stakeholders and leading a culture of data-driven reliability, m...Show moreLast updated: 16 days ago
    • Promoted
    • New!
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Onit IndiaIndia
    Site Reliability Engineer L2 to join our Core Infrastructure team.This role will help to ensure the reliability of a diverse set of applications across our AWS infrastructure.To be successful in th...Show moreLast updated: 15 hours ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Covenant ConsultantsChennai
    Role : Senior Site Reliability Engineer (SRE) - Python / Java Cloud We're looking for a highly skilled and motivated Senior Site Reliability Engineer to join our t...Show moreLast updated: 12 days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    HTC Global ServicesIndia
    Positions available in Hyderabad ( Nanakramguda ).Skills : GCP- GKE Google Kubernetes Engine Terraform Datadog, Dynatrace or similar tools Python or Any Scripting languages.If interested in the abov...Show moreLast updated: 16 days ago
    • Promoted
    • New!
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Landmark GroupBengaluru, Karnataka, India
    Senior Site Reliability Engineer (SRE).We are hiringa seasoned Site Reliability Engineer with strongexperience in building and operating scalable systems on Google Cloud Platform (GCP).You will be ...Show moreLast updated: 17 hours ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    GXS BankIndia
    Get to know the Role We treat Infrastructure and operations as Software Engineering problems.Our mission is to build and progress software platforms which enables the provisioning and managing of a...Show moreLast updated: 16 days ago
    • Promoted
    • New!
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    T D Newton & AssociatesIndia
    We are currently hiring with one of leading Healthcare Tech industry at Bangalore.Looking for only product based organization •. SRE, DevOps, CI / CD, Gitlab, building advanced observability solutions,...Show moreLast updated: 15 hours ago
    • Promoted
    • New!
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    ABC FitnessHyderabad, Telangana, India
    ABC is the trusted provider to boost performance and create a total fitness experience for over 41 million members of clubs of all sizes whether a multi-location chain, franchise or an independent ...Show moreLast updated: 15 hours ago
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Procore TechnologiesBengaluru, Karnataka, India
    Engineer to join Procores Infrastructure Platform division to work on our commercial initiatives.In this role youll help build Procores next-generation construction compute platform for other...Show moreLast updated: 30+ days ago
    • Promoted
    Josys - Senior Site Reliability Engineer

    Josys - Senior Site Reliability Engineer

    JosysBangalore
    Senior Site Reliability Engineer (SRE) About JOSYS : Josys, a dynamic B2B SaaS platform startup, has embarked on a mission t...Show moreLast updated: 17 days ago