Job Description :
Senior Infrastructure Test & Validation Engineer (Zero-Touch GPU Cloud – GitOps Validation & Certification)
We are seeking a
Senior Infrastructure Test & Validation Engineer
with 10+ years of experience to lead the
Zero-Touch Validation, Upgrade, and Certification automation
of our on-prem GPU cloud platform. This role focuses on ensuring the stability, performance, and conformance of the entire stack—from hardware to Kubernetes—using automated, GitOps-based validation pipelines. The ideal candidate has a strong infrastructure background with deep hands-on skills in
Sonobuoy ,
LitmusChaos ,
k6 , and
pytest , and is passionate about automated test orchestration, platform resilience, and continuous conformance.
Key Responsibilities
Design and implement
automated, GitOps-compliant pipelines
for
validation and certification
of the GPU cloud stack across hardware, OS, Kubernetes, and platform layers.
Integrate
Sonobuoy
for Kubernetes conformance and certification testing.
Design and orchestrate
chaos engineering workflows
using
LitmusChaos
to validate system resilience across failure scenarios.
Implement performance testing suites using
k6
and system-level benchmarks, integrated into CI / CD pipelines.
Develop and maintain
end-to-end test frameworks
using
pytest
and / or
Go , focusing on cluster lifecycle events, upgrade paths, and GPU workloads.
Ensure test coverage and validation across multiple dimensions : conformance, performance, fault injection, and post-upgrade validation.
Build and maintain dashboards and reporting for automated test results, including traceability, drift detection, and compliance tracking.
Collaborate with infrastructure, SRE, and platform teams to embed testing and validation early in the deployment lifecycle.
Own quality assurance gates for all automation-driven deployments.
Required Skills & Experience
10+ years of hands-on experience
in infrastructure engineering, systems validation, or SRE roles.
Primary key skills
required are pytest, Go, k6 scripting, automation frameworks integration (Sonobuoy, LitmusChaos), CI integration
Strong experience with :
Sonobuoy
for Kubernetes conformance and diagnostics
LitmusChaos
for fault injection and resilience validation
k6
for performance / load testing in distributed environments
pytest
or
Go-based test frameworks
for automation and validation scripting
Deep understanding of Kubernetes architecture, upgrade patterns, and operational risks.
Experience validating infrastructure components (GPU drivers, kernel modules, CNI, CRI, etc.) across lifecycle events.
Proficient in GitOps workflows and integrating tests into declarative, Git-backed pipelines (e.g., with Argo CD, Flux).
Hands-on experience with CI / CD systems (e.g., GitHub Actions, GitLab CI, Jenkins) to automate test orchestration.
Solid scripting and automation experience (Python, Bash, or Go).
Familiarity with GPU-based infrastructure and its performance characteristics is a strong plus.
Strong debugging, root cause analysis, and incident investigation skills.
Performance Automation • India