This job offer is not available in your country.

Site Reliability Engineer

NR Consultingbangalore, karnataka, in

1 day ago

Job description

Total Experience - 7+ Years

Relevant Experience- 5+ Years

Must have Experience in GPU at least 1 Year

Notice Period - up to 30 Days

JD :

We are seeking a skilled DevOps and AI Cloud Infrastructure Engineer to provision, deploy, manage, and optimize our GPU-based compute environment, ensuring high availability, performance, and security for compute-intensive workloads. The ideal candidate will have expertise in Linux system administration, cloud platforms, containerization, GPU hardware management, and cluster computing, with a focus on supporting AI / ML and high-performance computing (HPC) workloads. In this role, you will also provide technical support to investigate and resolve customer-reported issues related to the GPU-based compute environment. You will work closely with architects, AI engineers, and software developers to ensure seamless deployment, scalability, and reliability of our cloud-based AI / ML pipelines and GPU-based compute environments.

Key Responsibilities

Infrastructure Management : Provision, deploy, and maintain scalable, secure, and high-availability cloud infrastructure on platforms such as Digital Ocean Cloud to support AI workloads.
Documentation : Maintain clear documentation for infrastructure setups, and processes.
System Management : Administer and maintain Linux-based servers and clusters optimized for GPU compute workloads, ensuring high availability and performance.
GPU Infrastructure : Configure, monitor, and troubleshoot GPU hardware (e.g., NVIDIA GPUs) and related software stacks (e.g., CUDA, cuDNN) for optimal performance in AI / ML and HPC applications.
Troubleshooting : Diagnose and resolve hardware and software issues related to GPU compute nodes and performance issues in GPU clusters.
High-Speed Interconnects : Implement and manage high-speed networking technologies like RDMA over Converged Ethernet (RoCE) to support low-latency, high-bandwidth communication for GPU workloads.
Automation : Develop and maintain Infrastructure as Code (IaC) using tools like Terraform, Ansible to automate provisioning and management of resources.
CI / CD Pipelines : Build and optimize continuous integration and deployment (CI / CD) pipelines for testing GPU-based servers and managing deployments using tools like GitHub Actions.
Containerization & Orchestration : Build and manage LXC-based containerized environments to support cloud infrastructure and provisioning toolchains
Monitoring & Performance : Set up and maintain monitoring, logging, and alerting systems (e.g., Prometheus, Victoria Metrics, Grafana) to track system performance, GPU utilization, resource bottlenecks, and uptime of GPU resources.
Security and Compliance : Implement network security measures, including firewalls, VLANs, VPNs, and intrusion detection systems, to protect the GPU compute environment and comply with standards like SOC 2 or ISO 27001.
Cluster Support : Collaborate with other engineers to ensure seamless integration of networking with cluster management tools like Slurm, or PBS Pro.
Scalability : Optimize infrastructure for high-throughput AI workloads, including GPU and auto-scaling configurations.
Collaboration : Work closely with Architects, Software engineers to streamline model deployment, optimize resource utilization, and troubleshoot infrastructure issues.

Required Qualifications

Experience : 3+ years of experience in DevOps, Site Reliability Engineering (SRE), or cloud infrastructure management, with at least 1 year working on GPU-based compute environments in the cloud.

Create a job alert for this search

Site Reliability Engineer • bangalore, karnataka, in

Related jobs

Site Reliability Engineer

AIONBengaluru, KA, IN

Quick Apply

AION is building the next generation of AI cloud platform by transforming the future of high-performance computing (HPC) through its decentralized AI cloud. Purpose-built for bare-metal performance,...Show moreLast updated: 30+ days ago

Promoted
New!

Site Reliability Engineer II

RecRootsBengaluru, Karnataka, India

Key Job Responsibilities and Duties : The core premise for the SRE lies in treating operational issues as a software problem. We code our way out of problems where operations are concerned addressing ...Show moreLast updated: 16 hours ago

Promoted

Principal Site Reliability Engineer

Rakuten IndiaBengaluru, India

Design, develop SLA, SLO, SLI of services within the Business Unit.Involve in whole process of Development, Production System Operation including system maintenance, monitoring, automation, backend...Show moreLast updated: 11 days ago

Promoted

RMS Technical Expert - OSAT

Tata ElectronicsKolar, Karnataka, India

The RMS Technical Expert will be responsible for the design, deployment, and optimization of Reliability Monitoring Systems in an OSAT (Outsourced Semiconductor Assembly & Test) manufacturing envir...Show moreLast updated: 13 days ago

Promoted

Site Reliability Engineer

ElgebraBangalore

Role Overview : We are seeking a highly experienced and technically proficient Site Reliability Engineer (SRE) to join our team in support of our c...Show moreLast updated: 9 days ago

Promoted

Site Reliability Engineer

Core Minds Tech SOlutionsHosur

Job Description : - Engage with our product teams to understand requirements, design, and implement resilient and scalable infrastructure solutions&l...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer- ELK Expert

iVedha Inc.hosur, tamil nadu, in

Senior Site Reliability Engineer (SRE) – ELK Expert | Platform Engineering Practice.Must be available to work in the EST (US / Canada) Time Zone. Are you a Senior Site Reliability Engineer (SRE) with ...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

QualityKiosk Technologies Pvt. Ltd.Bengaluru, Karnataka, India

QualityKiosk Technologies is one of the world's largest independent Quality Engineering (QE) providers and digital transformation enablers, helping companies build and manage applications for optim...Show moreLast updated: 1 day ago

Promoted

Site Engineer

B1 BOUWERS INDIA PRIVATE LIMITEDHosur, Tamil Nadu, India

B1 Bouwer’s India Private Limited has a long-established history stretching back to 2019 when the company was originally founded in Chennai, India. Since then company has grown steadily to become on...Show moreLast updated: 30+ days ago

Site Reliability Engineer

Aqilea (formerly Soltia)Bangalore, Karnataka, India

Quick Apply

We are a consulting company with a bunch of technology-interested and happy people!.We love technology, we love design and we love quality. Our diversity makes us unique and creates an inclusive and...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer

WSO2Bengaluru, Karnataka, India

Founded in 2005, WSO2 is the largest independent software vendor providing open-source API management, integration, and identity and access management (IAM) to thousands of enterprises in over 90 c...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

SynechronBangalore Urban, Karnataka, India

We have immediate opportunity for.SRE (Senior Site Reliability Engineer) 5 to 9 years.SRE (Senior Site Reliability Engineer). We began life in 2001 as a small, self-funded team of technology special...Show moreLast updated: 30+ days ago

Promoted
New!

Senior Site Reliability Engineer

Sapaadhosur, tamil nadu, in

Our flagship product, also named Sapaad, has achieved remarkable success over the past decade, empowering.F&B businesses across 40+ countries. Driven by a passionate team of developers, designers, a...Show moreLast updated: 3 hours ago

Promoted

Site Reliability Engineer

People Realm Recruitment Services Private LimitedBengaluru, Karnataka, India

Job Title- Site Reliability Engineer.Desired Years of Experience - 5 - 14 Years of Relevant Experience.A Career with a Leading Global Investment Management Firm’s Technology Team.Our client, a lead...Show moreLast updated: 26 days ago

Promoted

Senior Site Reliability Engineer

RecRootsBangalore Urban, Karnataka, India

The core premise for the SRE lies in treating operational issues as a software problem.We code our way out of problems where operations are concerned, addressing availability, scalability, latency,...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer

ViewSonicBengaluru, Karnataka, India

Bachelor's degree in Computer Science, Engineering, or a related field.Site Reliability Engineer, DevOps Engineer, or similar, is preferred but not mandatory. Basic understanding of AWS solutions in...Show moreLast updated: 22 days ago

Promoted

Site Reliability Engineer

Amicon Hub ServicesBengaluru, Karnataka, India

Manage and scale production systems hosted on.Automate operational tasks using.Improve system reliability and reduce manual interventions through automation. Collaborate with development teams to en...Show moreLast updated: 12 days ago

Promoted

Site Reliability Engineer

NR ConsultingBangalore, Bangalore (division), India

Must have Experience in GPU at least 1 Year.We are seeking a skilled DevOps and AI Cloud Infrastructure Engineer to provision, deploy, manage, and optimize our GPU-based compute environment, ensuri...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer - Chaos Management

Xebiahosur, tamil nadu, in

AWS Engineer with strong Python development and Chaos Engineering expertise.The ideal candidate will combine cloud engineering, DevOps, and chaos experimentation to improve reliability, fault toler...Show moreLast updated: 13 days ago

Promoted

Site Reliability Engineer

ACL DigitalBengaluru, Karnataka, India

Service Management : Maintain application uptime / performance, manage system enhancements and defects, oversee daily operational activities, and ensure continuous improvement and adherence to ITIL be...Show moreLast updated: 6 days ago