We are looking for a Senior Software Engineer (SDE III) who will build, profile, and optimize GPU workloads powering next-generation generative AI experiences from Stable Diffusion image generation to transformer-based multimodal models.
you'll work closely with research and infrastructure teams to make model inference faster, more cost-efficient, and production-ready.
This role is ideal for engineers passionate about pushing GPUs to their limits , writing high-performance kernels, and turning cutting-edge research into scalable systems.
Key Responsibilities
- Develop, optimize, and maintain GPU kernels (CUDA, Triton, ROCm) for diffusion, attention, and convolution operators.
- Profile end-to-end inference pipelines (data movement, kernel scheduling, memory transfers) to identify and resolve bottlenecks.
- Apply techniques like operator fusion, tiling, caching, and mixed-precision compute to maximize GPU throughput.
- Collaborate with researchers to productionize experimental layers or model architectures.
- Build benchmarking tools and micro-tests for latency, memory, and throughput regressions.
- Integrate kernel improvements into serving stacks, ensuring reliability and repeatable performance .
- Work with platform teams to tune runtime configurations and job scheduling for GPU utilization.
Required Qualifications
4+ years of experience in systems or ML engineering, with 2+ years working on GPU or accelerator optimization .Strong hands-on skills with CUDA programming , memory hierarchies, warps, threads, and shared memory.Familiarity with profiling tools (Nsight, nvprof, CUPTI) and performance analysis.Working knowledge of PyTorch, JAX, or TensorFlow internals.Proficiency in C++ and Python .Experience with mixed precision , FP16 / BF16, or quantization.Deep curiosity about system bottlenecks and numerical correctness.Skills Required
C++, Python, Pytorch, Profiling Tools