Position : Senior Linux Administrator – AI / ML Infrastructure
Location : Remote
Experience : 5+ Years
Type : Full-time
Role Overview
We are seeking a highly skilled Senior Linux Administrator to join our team, focusing on the implementation and management of on-premises Linux servers optimized for AI / ML workloads.
The ideal candidate will have deep expertise in Linux system administration, Kubernetes cluster management, and a strong understanding of data center infrastructure components including servers, networking, storage, and virtualization technologies.
This role requires hands-on experience in automating infrastructure, optimizing performance, and ensuring reliability for high-performance computing (HPC) and AI / ML pipelines.
Key Responsibilities
Deploy, configure, and manage on-premises Linux servers supporting AI / ML workloads.
Set up, manage, and troubleshoot Kubernetes clusters for containerized workloads.
Optimize system and network performance for compute-intensive applications.
Automate provisioning and configuration using Ansible, Terraform, and scripting (Bash / Python).
Administer and monitor data center components such as servers, storage arrays, switches, and power systems.
Ensure system security, patch management, and compliance across environments.
Collaborate with DevOps, Data Science, and AI engineering teams to enable seamless integration with ML pipelines.
Plan and implement scalability strategies, maintaining uptime and redundancy.
Maintain comprehensive documentation of configurations, policies, and network diagrams.
Required Skills & Qualifications
7+ years of experience in Linux system administration (RHEL, Ubuntu, CentOS).
Proven hands-on experience with Kubernetes cluster management (setup, scaling, troubleshooting).
CKA (Certified Kubernetes Administrator) certification is mandatory.
Strong knowledge of data center components – servers, racks, networking switches, storage systems, and virtualization layers.
Experience with Ansible, Terraform, CI / CD pipelines, and infrastructure automation.
Proficiency in scripting languages (Bash, Python).
Understanding of performance tuning, system optimization, and fault diagnosis.
Excellent problem-solving, communication, and collaboration skills.
Preferred / Good to Have
Exposure to NVIDIA GPU management, CUDA environments, and AI / ML compute nodes.
Familiarity with HPC environments and distributed computing frameworks.
Experience managing monitoring systems (Prometheus, Grafana) and backup solutions.
Knowledge of DevOps practices, containerization, and hybrid cloud environments.
Engineer Ai Infrastructure • India