Job Description : Job Description :
Senior Infrastructure Automation Engineer (Zero-Touch GPU Cloud Build & Upgrade)
We are looking for a Senior Infrastructure Automation Engineer with 10+ years of hands on experience in building and scaling infrastructure automation systems to lead the design and implementation of a Zero-Touch Build, Upgrade, and Certification framework for our on-prem GPU cloud environment. This role demands deep technical expertise across bare-metal provisioning, configuration management, and full-stack automation—from hardware to Kubernetes—built entirely on GitOps principles .
Key Responsibilities
- Architect, lead, and implement a fully automated, zero-touch deployment pipeline for GPU cloud infrastructure spanning hardware → OS → Kubernetes → platform layers.
- Build robust GitOps-based workflows to manage end-to-end infrastructure lifecycle—from provisioning to continuous compliance.
- Design and maintain automation for :
- Bare-metal control : Power cycling, provisioning, remote installs
- Firmware and configuration flashing : BIOS, NIC, RAID, etc.
- Hardware inventory management
- Configuration drift detection and remediation
- Develop and extend internal automation frameworks using Ansible, Python , and related infrastructure tooling.
- Serve as a technical authority and mentor , guiding junior engineers and collaborating cross-functionally with hardware, SRE, and platform engineering teams.
- Lead architectural and design reviews for infrastructure automation systems.
- Define and implement best practices for infrastructure as code , compliance, and operational resilience.
- Champion automation-driven operational models and reduce manual intervention to near-zero.
- Bonus : Familiarity with Terraform, Chef, and Cloud Automation Platforms .
Required Skills & Experience
10+ years of hands-on experience in infrastructure engineering, automation, and systems design, with a strong track record of delivering scalable and maintainable solutions.Primary key skills required are Ansible, Python, ipmitool, firmware scripting, Linux shell scriptingDeep expertise in :Ansible for automation and configuration managementPython for scripting, integration, and automation logicipmitool and related tools for low-level hardware management (e.g., IPMI, Redfish)Proven experience with bare-metal automation in data center environments, including :Power control and PXE bootingBIOS / NIC / RAID firmware upgradesHardware and platform inventory systemsStrong foundation in Linux systems , networking, and Kubernetes infrastructure.Fluency with GitOps workflows and tools.Experience with CI / CD systems and managing Git-based pipelines for infrastructure.Familiarity with infrastructure monitoring, logging, and drift detection.Strong cross-team collaboration and communication skills, especially across hardware, platform, and SRE teams.Bonus :Prior leadership or mentorship rolesExperience contributing to or maintaining open-source infrastructure projectsExposure to GPU-based compute stacks and high-performance workloads