Job Description
Job Description
Customer Interview
No location criteria
Key Responsibilities :
- Analyze tracing logs from LLM inference and training runs to identify performance issues and inefficiencies.
- Develop tools and scripts to parse, visualize, and monitor LLM tracing data.
- Collaborate with ML and infra teams to recommend and implement performance optimizations.
- Create documentation and dashboards to track optimization progress over time.
- Investigate and resolve model latency and throughput issues related to runtime behavior.
- Contribute to best practices for performance tracing, benchmarking, and logging across model deployments.
Required Qualifications :
Bachelor’s or Master’s degree in Computer Science, Machine Learning, or related field.Experience working with large-scale ML models, preferably LLMs (e.g., GPT, BERT, etc.).Proficiency in Python and common ML frameworks (e.g., PyTorch, TensorFlow).Familiarity with model tracing tools such as PyTorch Profiler, TensorBoard, DeepSpeed, or similar.Strong problem-solving skills and attention to detail in analyzing complex logs and metrics.Preferred Qualifications :
Experience with distributed training / inference and GPU performance optimization.Knowledge of systems profiling tools (e.g., NVIDIA Nsight, perf, Flamegraphs).Background in MLOps, observability, or AI infrastructure.