Platform Engineer (Machine Learning Infrastructure)
QpiAI is a deep tech startup pioneering next-generation computing platforms. Our award-winning AI platform, QpiAI Pro, empowers enterprises to innovate and deploy AI solutions seamlessly across cloud and edge devices at scale. As AI impacts every industry, we're dedicated to making it easier to build meaningful AI powered experiences.
As a Platform Engineer at QpiAI, you'll build and maintain our core infrastructure, roadmap our products, and enhance our operations. We value curiosity, continuous self-improvement, first principles thinking, and staying updated on latest AI research.
Job Requirements :
- Experience in DevOps engineering with solid experience in Docker, Kubernetes and shell scripting.
- Experience Developing and Maintaining infrastructure for ML model training, Model Serving, ETL etc.
- Experience working with distributed GPU systems and ML infrastructure.
- Experience with modern tools for Data annotation, Data Curation, Experiment tracking, Model Registries, Workflow Orchestration systems etc.
- Experience building LLM powered systems like Retrieval Augmented Generation or Agentic Workflows.
- You can work your way out of an unfortunate “CUDA_ERROR_VERSION_MISMATCH" error or better prevent them altogether.
- Ability to setup up multi node kubernetes clusters on on-premise Datacenter and / or deep knowledge on managed kubernetes services, experience with kube native tooling and ecosystem.
- Expertise with configuration management systems like Ansible and IaaC (Terraform / pulumi).
- Experience with setting up and maintaining scalable data lakes and data processing pipelines.
- Ability to set up highly available services and databases.
- Expertise across cloud platforms - AWS, Azure and GCP and their services.
- Understanding of networking principles, load balancing, DNS configurations, proxies etc.
- Experience building robust integration and deployment pipelines.
- Strong problem solving, debugging and analytical skills. Ability to plan, execute projects, to deliver in time and with quality.
- Experience with ML orchestration services like kubeflow, flyte, prefect etc.
- Experience with distributed computing frameworks like ray
- Knowledge of Strong understanding of Role-Based Access Control (RBAC) principles to effectively manage permissions and access control.
- Knowledge on MLOps concepts like model and data versioning, orchestration and model serving.
- High level understanding of modern distributed applications and their design.
Job Responsibilities
Take ownership of infrastructure layer for our on-premise and cloud deploymentsAssist development teams in designing scalable and portable applications.Establish best practices within the organisation and help developers ship fast without breaking things.Product ownership and managing periodic reporting on the progress to the management and senior leadership.A team player who values collaboration, innovation, and inclusion.Mentor team members to adopt a platform-first mindset within the organisation.We are looking for the right teammate with a shared vision and a passion to build really impactful technology. If you feel that you fit most of our requirements but not all, please don't hesitate to apply!