Key Responsibilities
- Design and develop platform services for containerized model training , distributed computing , artifact management , and deployment automation .
- Build and maintain infrastructure that ensures scalability, reproducibility, and security of AI workflows across various environments.
- Implement CI / CD pipelines , observability frameworks, and developer tools to support multiple AI teams efficiently.
- Collaborate with cross-functional teamsincluding data scientists, ML engineers, architects, and developers—to deliver essential platform features.
- Champion automation , enforce performance and security best practices, and drive operational reliability across platform services.
- Provide technical mentorship and contribute to long-term architecture and platform strategy .
- Stay updated with the latest advancements in cloud infrastructure , AI / ML tooling , and platform engineering trends .
Skills & Qualifications
7+ years of experience in software or platform engineering, including at least 3 years working on infrastructure for data or AI systems.Demonstrated experience building infrastructure on GCP or AWS (GCP preferred).Proficiency in Python , Go , or JavaScript , with a focus on developing scalable and secure backend services .Strong knowledge of container orchestration (e.g., Kubernetes), serverless technologies , and Infrastructure as Code tools like Terraform.Hands-on experience with data processing frameworks (e.g., Spark) and workflow orchestration tools (e.g., Airflow).Deep understanding of CI / CD workflows , version control systems (e.g., Git), and modern monitoring / logging stacks (e.g., Prometheus, Grafana, ELK).Skills Required
Git, Prometheus, Grafana, Elk, Python, Javascript