Talent.com
This job offer is not available in your country.
Site Reliability Engineer

Site Reliability Engineer

NR ConsultingBangalore, Bangalore (division), India
1 day ago
Job description

Total Experience - 7+ Years

Relevant Experience- 5+ Years

Must have Experience in GPU at least 1 Year

Notice Period - up to 30 Days

JD :

We are seeking a skilled DevOps and AI Cloud Infrastructure Engineer to provision, deploy, manage, and optimize our GPU-based compute environment, ensuring high availability, performance, and security for compute-intensive workloads. The ideal candidate will have expertise in Linux system administration, cloud platforms, containerization, GPU hardware management, and cluster computing, with a focus on supporting AI / ML and high-performance computing (HPC) workloads. In this role, you will also provide technical support to investigate and resolve customer-reported issues related to the GPU-based compute environment. You will work closely with architects, AI engineers, and software developers to ensure seamless deployment, scalability, and reliability of our cloud-based AI / ML pipelines and GPU-based compute environments.

Key Responsibilities

  • Infrastructure Management : Provision, deploy, and maintain scalable, secure, and high-availability cloud infrastructure on platforms such as Digital Ocean Cloud to support AI workloads.
  • Documentation : Maintain clear documentation for infrastructure setups, and processes.
  • System Management : Administer and maintain Linux-based servers and clusters optimized for GPU compute workloads, ensuring high availability and performance.
  • GPU Infrastructure : Configure, monitor, and troubleshoot GPU hardware (e.g., NVIDIA GPUs) and related software stacks (e.g., CUDA, cuDNN) for optimal performance in AI / ML and HPC applications.
  • Troubleshooting : Diagnose and resolve hardware and software issues related to GPU compute nodes and performance issues in GPU clusters.
  • High-Speed Interconnects : Implement and manage high-speed networking technologies like RDMA over Converged Ethernet (RoCE) to support low-latency, high-bandwidth communication for GPU workloads.
  • Automation : Develop and maintain Infrastructure as Code (IaC) using tools like Terraform, Ansible to automate provisioning and management of resources.
  • CI / CD Pipelines : Build and optimize continuous integration and deployment (CI / CD) pipelines for testing GPU-based servers and managing deployments using tools like GitHub Actions.
  • Containerization & Orchestration : Build and manage LXC-based containerized environments to support cloud infrastructure and provisioning toolchains
  • Monitoring & Performance : Set up and maintain monitoring, logging, and alerting systems (e.g., Prometheus, Victoria Metrics, Grafana) to track system performance, GPU utilization, resource bottlenecks, and uptime of GPU resources.
  • Security and Compliance : Implement network security measures, including firewalls, VLANs, VPNs, and intrusion detection systems, to protect the GPU compute environment and comply with standards like SOC 2 or ISO 27001.
  • Cluster Support : Collaborate with other engineers to ensure seamless integration of networking with cluster management tools like Slurm, or PBS Pro.
  • Scalability : Optimize infrastructure for high-throughput AI workloads, including GPU and auto-scaling configurations.
  • Collaboration : Work closely with Architects, Software engineers to streamline model deployment, optimize resource utilization, and troubleshoot infrastructure issues.

Required Qualifications

  • Experience : 3+ years of experience in DevOps, Site Reliability Engineering (SRE), or cloud infrastructure management, with at least 1 year working on GPU-based compute environments in the cloud.
  • Create a job alert for this search

    Site Reliability Engineer • Bangalore, Bangalore (division), India

    Related jobs
    • Promoted
    Site Reliability Engineer - Observability Services

    Site Reliability Engineer - Observability Services

    TeamWare SolutionsBangalore
    Role Summary : We are seeking a highly skilled Site Reliability Engineer (SRE) with a strong focus on observability.The ideal candidate will have 5-8 years of experie...Show moreLast updated: 30+ days ago
    Site Reliability Engineer

    Site Reliability Engineer

    AIONBengaluru, KA, IN
    Quick Apply
    AION is building the next generation of AI cloud platform by transforming the future of high-performance computing (HPC) through its decentralized AI cloud. Purpose-built for bare-metal performance,...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    NR ConsultingBengaluru, Karnataka, India
    Total Experience - 7+ Years Relevant Experience- 5+ Years Must have Experience in GPU at least 1 Year Notice Period - up to 30 Days JD : We are seeking a skilled DevOps and AI Cloud Infrastruc...Show moreLast updated: 1 day ago
    • Promoted
    Principal Site Reliability Engineer

    Principal Site Reliability Engineer

    Rakuten IndiaBengaluru, India
    Design, develop SLA, SLO, SLI of services within the Business Unit.Involve in whole process of Development, Production System Operation including system maintenance, monitoring, automation, backend...Show moreLast updated: 12 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ViewSonicBengaluru, Karnataka, India
    Bachelor's degree in Computer Science, Engineering, or a related field.Site Reliability Engineer, DevOps Engineer, or similar, is preferred but not mandatory. Basic understanding of AWS solutions in...Show moreLast updated: 23 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ElgebraBangalore
    Role Overview : We are seeking a highly experienced and technically proficient Site Reliability Engineer (SRE) to join our team in support of our c...Show moreLast updated: 10 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Core Minds Tech SOlutionsHosur
    Job Description : - Engage with our product teams to understand requirements, design, and implement resilient and scalable infrastructure solutions&l...Show moreLast updated: 30+ days ago
    • Promoted
    LSEG - Site Reliability Engineer

    LSEG - Site Reliability Engineer

    REFINITIV INDIA SHARED SERVICES PRIVATE LIMITEDBangalore
    LSEG is a leading global financial markets infrastructure and data provider.Our purpose is driving financial stability, empowering economies and enabling customers to create sustainable growth.Our ...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    QualityKiosk Technologies Pvt. Ltd.Bengaluru, Karnataka, India
    QualityKiosk Technologies is one of the world's largest independent Quality Engineering (QE) providers and digital transformation enablers, helping companies build and manage applications for optim...Show moreLast updated: 1 day ago
    • Promoted
    Site Engineer

    Site Engineer

    B1 BOUWERS INDIA PRIVATE LIMITEDHosur, Tamil Nadu, India
    B1 Bouwer’s India Private Limited has a long-established history stretching back to 2019 when the company was originally founded in Chennai, India. Since then company has grown steadily to become on...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer II

    Site Reliability Engineer II

    RecRootsBengaluru, India
    Key Job Responsibilities and Duties : .The core premise for the SRE lies in treating operational issues as a software problem. We code our way out of problems where operations are concerned addressing...Show moreLast updated: 1 day ago
    Site Reliability Engineer

    Site Reliability Engineer

    Aqilea (formerly Soltia)Bangalore, Karnataka, India
    Quick Apply
    We are a consulting company with a bunch of technology-interested and happy people!.We love technology, we love design and we love quality. Our diversity makes us unique and creates an inclusive and...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    WSO2Bengaluru, Karnataka, India
    Founded in 2005, WSO2 is the largest independent software vendor providing open-source API management, integration, and identity and access management (IAM) to thousands of enterprises in over 90 c...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    SynechronBangalore Urban, Karnataka, India
    We have immediate opportunity for.SRE (Senior Site Reliability Engineer) 5 to 9 years.SRE (Senior Site Reliability Engineer). We began life in 2001 as a small, self-funded team of technology special...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    People Realm Recruitment Services Private LimitedBengaluru, Karnataka, India
    Job Title- Site Reliability Engineer.Desired Years of Experience - 5 - 14 Years of Relevant Experience.A Career with a Leading Global Investment Management Firm’s Technology Team.Our client, a lead...Show moreLast updated: 26 days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    RecRootsBangalore Urban, Karnataka, India
    The core premise for the SRE lies in treating operational issues as a software problem.We code our way out of problems where operations are concerned, addressing availability, scalability, latency,...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Amicon Hub ServicesBengaluru, Karnataka, India
    Manage and scale production systems hosted on.Automate operational tasks using.Improve system reliability and reduce manual interventions through automation. Collaborate with development teams to en...Show moreLast updated: 12 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    TrantorBengaluru, Karnataka, India
    Job Title - Site Reliability Engineer Role- Contract (9 Months- Extendable) Exp- 5+ years Loc- Bangalore ( Hybrid) Notice- Immediate joiner only Duties : Responsible for maintaining and scaling pro...Show moreLast updated: 6 days ago
    • Promoted
    • New!
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    SapaadBengaluru, IN
    Our flagship product, also named Sapaad, has achieved remarkable success over the past decade, empowering.F&B businesses across 40+ countries. Driven by a passionate team of developers, designers, a...Show moreLast updated: 11 hours ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ACL DigitalBengaluru, Karnataka, India
    Service Management : Maintain application uptime / performance, manage system enhancements and defects, oversee daily operational activities, and ensure continuous improvement and adherence to ITIL be...Show moreLast updated: 6 days ago