SRE & DevOps Engineer (ML / AI Platform)
Contract Position | Global E-Commerce Leader | Hybrid
About the Opportunity
We're partnering with a leading global e-commerce company to find an exceptional SRE & DevOps Engineer to join their AI Platform Team. This is your chance to shape the future of machine learning infrastructure that powers innovation for millions of users worldwide.
As part of this transformative role, you'll support cutting-edge AI platforms and services, working alongside researchers, data scientists, and engineering teams in a purpose-driven, inclusive environment.
What You'll Do
Platform Operations & Support
- Support next-generation AI architecture for research and engineering teams
- Partner with vendors and infrastructure teams to ensure security and 99.999% service availability
- Diagnose and resolve production issues, including performance and functional challenges
- Provide technical support to customers and document solutions
DevOps & Automation
Design and implement zero-downtime monitoring for highly available servicesBuild CI / CD pipelines for automated deployment and configurationIdentify automation opportunities to streamline problem managementDevelop operational standards for tools, versioning, source control, and deployment practicesContinuous Improvement
Drive customer service enhancements and recommend product improvementsDefine engineering excellence and operational maturity standardsConduct customer training and generate insights reportsAccelerate team efficiency through automation and knowledge sharingWhat You Bring
Required Expertise
Should be having 5+ years of experience.Strong Python development skills with data structure, algorithm, experience in designing, building, and releasing production softwareHands-on experience with ML frameworks : PyTorch, TensorFlow, TritonCloud-native technologies : Kubernetes, Docker, LinuxDevOps proficiency : CI / CD pipelines, Jenkins, test automationFramework troubleshooting : version upgrades, compatibility managementExcellent debugging and triaging capabilitiesPreferred Skills
Experience with AI / ML model training and inference platformsLLM fine-tuning systems knowledgePerformance monitoring and application deployment automation#SRE #DevOps #MLOps #AI #MachineLearning #Kubernetes #Python #PyTorch #TensorFlow #CloudEngineering #Hiring #TechJobs #ContractRole