Talent.com
This job offer is not available in your country.
Site Reliability Engineer

Site Reliability Engineer

NR Consultingbangalore, karnataka, in
1 day ago
Job description

Total Experience - 7+ Years

Relevant Experience- 5+ Years

Must have Experience in GPU at least 1 Year

Notice Period - up to 30 Days

JD :

We are seeking a skilled DevOps and AI Cloud Infrastructure Engineer to provision, deploy, manage, and optimize our GPU-based compute environment, ensuring high availability, performance, and security for compute-intensive workloads. The ideal candidate will have expertise in Linux system administration, cloud platforms, containerization, GPU hardware management, and cluster computing, with a focus on supporting AI / ML and high-performance computing (HPC) workloads. In this role, you will also provide technical support to investigate and resolve customer-reported issues related to the GPU-based compute environment. You will work closely with architects, AI engineers, and software developers to ensure seamless deployment, scalability, and reliability of our cloud-based AI / ML pipelines and GPU-based compute environments.

Key Responsibilities

  • Infrastructure Management : Provision, deploy, and maintain scalable, secure, and high-availability cloud infrastructure on platforms such as Digital Ocean Cloud to support AI workloads.
  • Documentation : Maintain clear documentation for infrastructure setups, and processes.
  • System Management : Administer and maintain Linux-based servers and clusters optimized for GPU compute workloads, ensuring high availability and performance.
  • GPU Infrastructure : Configure, monitor, and troubleshoot GPU hardware (e.g., NVIDIA GPUs) and related software stacks (e.g., CUDA, cuDNN) for optimal performance in AI / ML and HPC applications.
  • Troubleshooting : Diagnose and resolve hardware and software issues related to GPU compute nodes and performance issues in GPU clusters.
  • High-Speed Interconnects : Implement and manage high-speed networking technologies like RDMA over Converged Ethernet (RoCE) to support low-latency, high-bandwidth communication for GPU workloads.
  • Automation : Develop and maintain Infrastructure as Code (IaC) using tools like Terraform, Ansible to automate provisioning and management of resources.
  • CI / CD Pipelines : Build and optimize continuous integration and deployment (CI / CD) pipelines for testing GPU-based servers and managing deployments using tools like GitHub Actions.
  • Containerization & Orchestration : Build and manage LXC-based containerized environments to support cloud infrastructure and provisioning toolchains
  • Monitoring & Performance : Set up and maintain monitoring, logging, and alerting systems (e.g., Prometheus, Victoria Metrics, Grafana) to track system performance, GPU utilization, resource bottlenecks, and uptime of GPU resources.
  • Security and Compliance : Implement network security measures, including firewalls, VLANs, VPNs, and intrusion detection systems, to protect the GPU compute environment and comply with standards like SOC 2 or ISO 27001.
  • Cluster Support : Collaborate with other engineers to ensure seamless integration of networking with cluster management tools like Slurm, or PBS Pro.
  • Scalability : Optimize infrastructure for high-throughput AI workloads, including GPU and auto-scaling configurations.
  • Collaboration : Work closely with Architects, Software engineers to streamline model deployment, optimize resource utilization, and troubleshoot infrastructure issues.

Required Qualifications

  • Experience : 3+ years of experience in DevOps, Site Reliability Engineering (SRE), or cloud infrastructure management, with at least 1 year working on GPU-based compute environments in the cloud.
  • Create a job alert for this search

    Site Reliability Engineer • bangalore, karnataka, in

    Related jobs
    Site Reliability Engineer

    Site Reliability Engineer

    AIONBengaluru, KA, IN
    Quick Apply
    AION is building the next generation of AI cloud platform by transforming the future of high-performance computing (HPC) through its decentralized AI cloud. Purpose-built for bare-metal performance,...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Site Reliability Engineer II

    Site Reliability Engineer II

    RecRootsBengaluru, Karnataka, India
    Key Job Responsibilities and Duties : The core premise for the SRE lies in treating operational issues as a software problem. We code our way out of problems where operations are concerned addressing ...Show moreLast updated: 16 hours ago
    • Promoted
    Principal Site Reliability Engineer

    Principal Site Reliability Engineer

    Rakuten IndiaBengaluru, India
    Design, develop SLA, SLO, SLI of services within the Business Unit.Involve in whole process of Development, Production System Operation including system maintenance, monitoring, automation, backend...Show moreLast updated: 11 days ago
    • Promoted
    RMS Technical Expert - OSAT

    RMS Technical Expert - OSAT

    Tata ElectronicsKolar, Karnataka, India
    The RMS Technical Expert will be responsible for the design, deployment, and optimization of Reliability Monitoring Systems in an OSAT (Outsourced Semiconductor Assembly & Test) manufacturing envir...Show moreLast updated: 13 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ElgebraBangalore
    Role Overview : We are seeking a highly experienced and technically proficient Site Reliability Engineer (SRE) to join our team in support of our c...Show moreLast updated: 9 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Core Minds Tech SOlutionsHosur
    Job Description : - Engage with our product teams to understand requirements, design, and implement resilient and scalable infrastructure solutions&l...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer- ELK Expert

    Senior Site Reliability Engineer- ELK Expert

    iVedha Inc.hosur, tamil nadu, in
    Senior Site Reliability Engineer (SRE) – ELK Expert | Platform Engineering Practice.Must be available to work in the EST (US / Canada) Time Zone. Are you a Senior Site Reliability Engineer (SRE) with ...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    QualityKiosk Technologies Pvt. Ltd.Bengaluru, Karnataka, India
    QualityKiosk Technologies is one of the world's largest independent Quality Engineering (QE) providers and digital transformation enablers, helping companies build and manage applications for optim...Show moreLast updated: 1 day ago
    • Promoted
    Site Engineer

    Site Engineer

    B1 BOUWERS INDIA PRIVATE LIMITEDHosur, Tamil Nadu, India
    B1 Bouwer’s India Private Limited has a long-established history stretching back to 2019 when the company was originally founded in Chennai, India. Since then company has grown steadily to become on...Show moreLast updated: 30+ days ago
    Site Reliability Engineer

    Site Reliability Engineer

    Aqilea (formerly Soltia)Bangalore, Karnataka, India
    Quick Apply
    We are a consulting company with a bunch of technology-interested and happy people!.We love technology, we love design and we love quality. Our diversity makes us unique and creates an inclusive and...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    WSO2Bengaluru, Karnataka, India
    Founded in 2005, WSO2 is the largest independent software vendor providing open-source API management, integration, and identity and access management (IAM) to thousands of enterprises in over 90 c...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    SynechronBangalore Urban, Karnataka, India
    We have immediate opportunity for.SRE (Senior Site Reliability Engineer) 5 to 9 years.SRE (Senior Site Reliability Engineer). We began life in 2001 as a small, self-funded team of technology special...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Sapaadhosur, tamil nadu, in
    Our flagship product, also named Sapaad, has achieved remarkable success over the past decade, empowering.F&B businesses across 40+ countries. Driven by a passionate team of developers, designers, a...Show moreLast updated: 3 hours ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    People Realm Recruitment Services Private LimitedBengaluru, Karnataka, India
    Job Title- Site Reliability Engineer.Desired Years of Experience - 5 - 14 Years of Relevant Experience.A Career with a Leading Global Investment Management Firm’s Technology Team.Our client, a lead...Show moreLast updated: 26 days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    RecRootsBangalore Urban, Karnataka, India
    The core premise for the SRE lies in treating operational issues as a software problem.We code our way out of problems where operations are concerned, addressing availability, scalability, latency,...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ViewSonicBengaluru, Karnataka, India
    Bachelor's degree in Computer Science, Engineering, or a related field.Site Reliability Engineer, DevOps Engineer, or similar, is preferred but not mandatory. Basic understanding of AWS solutions in...Show moreLast updated: 22 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Amicon Hub ServicesBengaluru, Karnataka, India
    Manage and scale production systems hosted on.Automate operational tasks using.Improve system reliability and reduce manual interventions through automation. Collaborate with development teams to en...Show moreLast updated: 12 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    NR ConsultingBangalore, Bangalore (division), India
    Must have Experience in GPU at least 1 Year.We are seeking a skilled DevOps and AI Cloud Infrastructure Engineer to provision, deploy, manage, and optimize our GPU-based compute environment, ensuri...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer - Chaos Management

    Site Reliability Engineer - Chaos Management

    Xebiahosur, tamil nadu, in
    AWS Engineer with strong Python development and Chaos Engineering expertise.The ideal candidate will combine cloud engineering, DevOps, and chaos experimentation to improve reliability, fault toler...Show moreLast updated: 13 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ACL DigitalBengaluru, Karnataka, India
    Service Management : Maintain application uptime / performance, manage system enhancements and defects, oversee daily operational activities, and ensure continuous improvement and adherence to ITIL be...Show moreLast updated: 6 days ago