We're seeking a hands-on DevOps / Cloud Engineer who can own our infrastructure end-to-end and support a fast-moving ML engineering team . You will design, build, and operate the cloud, data, and deployment systems that power our product — with a strong emphasis on reliability, automation, security, and speed .
What You’ll Do
- Design & manage cloud infrastructure (AWS / GCP / Azure) for core services, data pipelines, and ML workloads
- Build CI / CD pipelines for backend systems, ML model training / inference, and microservices
- Deploy & optimize ML clusters — GPU nodes, autoscaling, Docker / Kubernetes-based workloads
- Implement observability : metrics, logs, traces, alerting, SLOs
- Own Infrastructure-as-Code (Terraform, CDK, Helm) for all environments
- Improve developer velocity through automated builds, tests, deploy tools, and internal platform enhancements
- Ensure security best practices : IAM, secrets management, VPC / networking, least-privilege
- Participate in on-call rotation and support incident response for production systems
What You Bring
Strong experience with AWS / GCP (EKS / GKE, EC2, VPC, IAM, S3 / Cloud Storage)Deep expertise in Docker & Kubernetes , including GPU-aware orchestrationHands-on experience with CI / CD systems (GitHub Actions, GitLab CI, Argo, Jenkins)Knowledge of MLOps tooling : model serving, feature pipelines, artifact / versioning(MLflow, Weights & Biases, KServe, Ray)Solid Python or Go skills for automation toolingExperience with monitoring & logging stacks (Prometheus, Grafana, ELK / EFK, OpenTelemetry)Startup mindset : move fast, debug chaos, own outcomes end-to-endNice-to-Haves
Experience with distributed training (Ray, Horovod, PyTorch Distributed)Security hardening , cost optimization, and infrastructure scaling expertiseExperience setting up GPU clusters (NVIDIA drivers, CUDA, MIG, Triton, Slurm / Ray clusters)