This job offer is not available in your country.

Data Scientist

FPX AIpanipat, haryana, in

6 hours ago

Job description

Role Overview

FPX is building an AI infrastructure marketplace that enables developers to access and deploy compute efficiently. As a PyTorch + CUDA Engineer focused on benchmarking and performance, you will be responsible for designing, running, and interpreting benchmarks across model, framework, and hardware stacks. Your role is critical in validating performance claims, detecting regressions, and guiding optimizations to ensure FPX remains at the cutting edge of compute efficiency.

You will collaborate closely with ML systems, compiler, hardware, and platform teams. You should have strong experience in PyTorch internals, GPU programming, and profiling tools.

Key Responsibilities

Define, build, and maintain a benchmark suite covering representative deep learning workloads (training, inference, mixed) across modalities (vision, NLP, recommendation, etc.).
Automate running of benchmarks across multiple hardware configurations (NVIDIA GPUs, possibly AMD, and future accelerators).
Use profiling, tracing, and performance tools (e.g. Nsight Systems, Nsight Compute, PyTorch Profiler, CUPTI, NVTX) to identify bottlenecks across layers (operator, kernel, memory, data movement).
Write and maintain scripts / harnesses that manage benchmark orchestration, result collection, and analysis (latency, throughput, memory usage, utilization metrics).
Detect and triage performance regressions (e.g. nightly, CI-integrated benchmarks).
Partner with compiler / runtime / kernel teams to propose optimizations, micro-bench kernel patches, fusion, operator-level improvements, or configuration tuning.
Validate performance improvements across scale (multi-GPU, distributed) and in production-like settings.
Publish benchmark results, document methodology, and communicate trade-offs to stakeholders (engineering, product, customers).
Occasionally assist in custom kernel development when needed (e.g. fused kernels, optimized CUDA code) or integrating specialized libraries (Triton, CUTLASS, cuBLAS, cuDNN).
Stay up-to-date on new features in PyTorch (e.g. torch.compile, CUDA Graphs, new backends) and evaluate their impact.

Required Qualifications

BS / MS / PhD in Computer Science, Electrical Engineering, or equivalent experience.

Solid experience (3+ years) in GPU programming : CUDA, kernel development, memory management, concurrency.

Deep familiarity with PyTorch internals (operators, autograd, dispatcher, JIT / inductor pipeline or equivalent).

Experience with profiling and analysis of GPU workloads (Nsight, CUPTI, NVTX, PyTorch Profiler).

Strong Python and C++ skills.

Ability to analyze low-level performance (latency, throughput, memory, occupancy) and correlate to high-level model behavior.

Experience writing benchmark harnesses, automation, and result pipelines.

Excellent communication skills — able to present performance trade-offs and complex analysis to technical and non-technical audiences.

Preferred / Nice-to-Have

Experience with distributed training / inference (DDP, FSDP, model parallelism).

Experience with PyTorch’s newer compilation pathways (e.g. torch.compile, Inductor, Dynamo).

Knowledge of CUDA Graphs, kernel fusion, memory optimizations, tensor core usage.

Experience with other ML frameworks and baselining comparisons (TensorFlow, JAX, ONNX).

Published benchmarks, open-source contributions, or performance tools development.

Prior experience in systems, compilers, or GPU runtime development.

Familiarity with scaling benchmarks, cluster deployments, and heterogeneous hardware.

Compensation

Competitive salary + equity + benefits.

Potential for bonuses tied to performance improvements and critical benchmark delivery.

Create a job alert for this search

Data Scientist • panipat, haryana, in