Role – MLOps Engineer
Job Purpose :
The Staff MLOps Engineer plays a pivotal role in shaping our MLOps practice within ITG by building and enhancing a scalable, reliable, and cutting-edge Machine Learning Operations (MLOps) platform. This role combines deep cloud architecture expertise with advanced AI / ML knowledge to develop solutions that streamline workflows, enable seamless collaboration, and drive innovation.
As a key contributor to the organization’s AI / ML strategy, you will partner with cross-functional teams, including data scientists, product managers, and cloud engineers, to align platform development with business objectives. Your work will directly support the deployment of Responsible AI solutions that prioritize transparency, fairness, and ethical practices.
Knowledge, Skills, Abilities, Behaviors :
- Platform Development : Lead the enhancement of the AI platform to improve the developer experience for data and ML engineers. Optimize workflows by integrating state-of-the-art tools and technologies, ensuring scalability and efficiency.
- Cloud Infrastructure Design and Management : Architect and manage the cloud infrastructure supporting the MLOps platform, leveraging infrastructure-as-code (IaC) tools like Terraform. Optimize for scalability, security, cost-effectiveness, and high availability.
- Cross-Functional Collaboration and Stakeholder Management : Partner with data science, product management, engineering, and business teams to understand their requirements and ensure the MLOps platform effectively supports their needs. Effectively communicate technical concepts and strategies to both technical and non-technical audiences.
- AI / ML Reliability and Observability : Collaborate with the AI / ML reliability engineering team to design and implement components that ensure the platform’s operational reliability, observability, and fault tolerance.
- Cross-Disciplinary Knowledge : Apply knowledge from related disciplines, such as data science and health / biology sciences, to design holistic MLOps solutions that meet the unique needs of the organization.
- DevOps for Machine Learning Workloads : Build and maintain robust DevOps pipelines tailored for ML workflows, enabling automated model training, testing, deployment, and monitoring.
- Tool Development and System Reliability : Design and manage tools to enhance platform reliability, including dashboards, logging systems, and alerting frameworks, to ensure seamless operations.
- Advanced proficiency in cloud platforms, especially Google Cloud Platform (GCP). Experience with on-premises and edge deployments is a plus.
- Solid understanding of AI / ML concepts, technologies, and best practices, with hands-on experience deploying ML solutions at scale.
- Proven ability to work closely with peer teams, data scientists, and product managers to align platform development with strategic goals.
- Proficiency in Python and other scripting tools for automation and platform optimization.
- Strong analytical and troubleshooting skills, with a track record of solving complex problems under pressure.
- Proven experience managing and leading cloud architecture and engineering teams.