We're looking for someone who's as comfortable building ML pipelines as they are optimizing infrastructure for scale. If you thrive on solving real-world data challenges, love experimenting, and don't shy away from getting your hands dirty with deployment, this is your :
- Apply a strong understanding of machine learning principles and algorithms, with a focus on LLMs such as GPT-4 BERT, and similar architectures.
- Leveraged deep learning frameworks like TensorFlow, PyTorch, or Keras to train and fine-tune LLMs.
- Utilize deep knowledge of computer architecture, especially GPUs, to maximize utilization and efficiency.
- Work with cloud platforms(AWS, Azure, GCP) to manage and optimize resources for training large-scale deep learning models.
- Use containerization and orchestration tools(Docker, Kubernetes) for scalable and reproducible ML deployments.
- Apply principles of parallel and distributed computing, including distributed training for deep learning models.
- Work with big data and distributed computing technologies(Hadoop, Spark) to handle large-volume datasets.
- Implement MLOps practices and use related tools to manage the complete ML lifecycle.
- Contribute to the infrastructure side of multiple ML projects, particularly those involving deep learning models such as BERT and Transformers.
- Manage resources and optimize performance for large-scale ML workloads, both on-premise and in the cloud.
- Handle challenges in training large models, including memory management, optimizing data loading, and troubleshooting hardware issues.
- Collaborate closely with data scientists and ML engineers to understand infrastructure needs and deliver efficient solutions.
Requirements :
Strong knowledge of machine learning and deep learning algorithms, especially LLMs.Proficiency in Python and deep learning frameworks (TensorFlow, PyTorch, Keras).Expertise in GPU architecture and optimization.Experience with parallel and distributed computing concepts.Hands-on with containerization (Docker) and orchestration (Kubernetes).Tech Stack and Tools :
Cloud : AWS, Azure, GCP.Big Data : Hadoop, Spark.MLOps Tools : MLflow, Kubeflow, or similar.Infrastructure Optimization : Resource allocation, distributed training, GPU performance tuning.Nice-to-Have :
Prior experience training large-scale deep learning models(BERT, Transformers).Exposure to high-scale environments and large datasets.Ability to troubleshoot hardware bottlenecks and optimize data pipelines.(ref : hirist.tech)