Job Description :
Senior Kubernetes Platform Engineer (Zero-Touch GPU Cloud – GitOps Automation)
We are looking for a Senior Kubernetes Platform Engineer with 10+ years of infrastructure experience to design and implement the Zero-Touch Build, Upgrade, and Certification pipeline for our on-premises GPU cloud platform. This role focuses on automating the Kubernetes layer and its dependencies (e.g., GPU drivers, networking, runtime) using 100% GitOps workflows . You will work across teams to deliver a fully declarative, scalable, and reproducible infrastructure stack—from hardware to Kubernetes and platform services.
Key Responsibilities
- Architect and implement GitOps-driven Kubernetes cluster lifecycle automation using tools like kubeadm , ClusterAPI , Helm , and Argo CD .
- Develop and manage declarative infrastructure components for :
- GPU stack deployment (e.g., NVIDIA GPU Operator )
- Container runtime configuration ( Containerd )
- Networking layers ( CNI plugins like Calico, Cilium, etc.)
- Lead automation efforts to enable zero-touch upgrades and certification pipelines for Kubernetes clusters and associated workloads.
- Maintain Git-backed sources of truth for all platform configurations and integrations.
- Standardize deployment practices across multi-cluster GPU environments, ensuring scalability, repeatability, and compliance.
- Drive observability, testing, and validation as part of the continuous delivery process (e.g., cluster conformance, GPU health checks).
- Collaborate with infrastructure, security, and SRE teams to ensure seamless handoffs between lower layers (hardware / OS) and the Kubernetes platform.
- Mentor junior engineers and contribute to the platform automation roadmap.
Required Skills & Experience
10+ years of hands-on experience in infrastructure engineering, with a strong focus on Kubernetes-based environments.Primary key skills required are Kubernetes API, Helm templating, Argo CD GitOps integration, Go / Python scripting, ContainerdDeep knowledge and hands-on experience with :Kubernetes cluster management (kubeadm, ClusterAPI)Argo CD for GitOps-based deliveryHelm for application and cluster add-on packagingContainerd as a container runtime and its integration in GPU workloadsExperience deploying and operating the NVIDIA GPU Operator or equivalent in production environments.Solid understanding of CNI plugin ecosystems , network policies, and multi-tenant networking in Kubernetes.Strong GitOps mindset with experience managing infrastructure as code through Git-based workflows.Experience building Kubernetes clusters in on-prem environments (vs. managed cloud services).Proven ability to scale and manage multi-cluster, GPU-accelerated workloads with high availability and security.Solid scripting and automation skills (Bash, Python, or Go).Familiarity with Linux internals, systemd, and OS-level tuning for container workloads.Bonus :Experience with custom controllers, operators, or Kubernetes API extensionsContributions to Kubernetes or CNCF projectsExposure to service meshes, ingress controllers, or workload identity providers