Principal Machine Learning Engineer - Multimodal AI & Inference
Bangalore
Founded in 2023,by Industry veterans HQ in California,US
- We are revolutionizing sustainable AI compute through intuitive software with composable silicon
Overview :
You will design, optimize, and deploy large multimodal models (language, vision, audio, video) to run efficiently on a compact, high-performance AI appliance capable of supporting 100B+ parameter models at real-time speeds. Your mission is to deliver state-of-the-art multimodal inference locally through advanced model optimization, quantization, and system-level integration.
Key Responsibilities :
1. Model Integration & Porting
Optimize large-scale foundation models (e.G., Llama, gpt-oss, Whisper, HiDream, Qwen, Wan etc) for on-device inference.Adapt pre-trained models for multimodal tasks (text, image, audio, video, or cross-modal reasoning).Ensure seamless interoperability between modalities — e.G., enabling the system to “see, hear, and talk” naturally.2. Model Optimization for Edge Hardware
Quantize and compress large models (4-bit or mixed precision) while maintaining high accuracy and low latency.Implement and benchmark inference runtimes using frameworks like Llama.Cpp, Ollama, vLLM, ONNX etc.Collaborate with hardware engineers to co-design model architectures optimized for the appliance’s compute fabric.3. Inference Pipeline Development
Build and maintain scalable, high-throughput inference pipelines capable of handling concurrent multimodal requests (text, audio, image, video).Implement token streaming, caching, and scheduling strategies for real-time responses.Develop APIs for low-latency local inference accessible via a web interface.4. Evaluation & Benchmarking
Profile and benchmark performance (throughput, latency, energy efficiency) of deployed models.Run regression tests to validate numerical accuracy after quantization or pruning.Define KPIs for multimodal model performance under real-world usage.5. Research & Prototyping
Investigate emerging multimodal architectures and lightweight model variants for local deployment.Prototype hybrid models that combine LLMs, diffusion models, and ASR / TTS pipelines for advanced multimodal applications.Stay current on state-of-the-art inference frameworks, compression techniques, and multimodal learning trends.Required Qualifications :
Strong background in deep learning and model deployment, with hands-on experience in PyTorch and / or TensorFlow.Expertise in model optimization — quantization, pruning, distillation, or mixed-precision inference.Practical knowledge of inference engines (vLLM, llama.Cpp, ONNX Runtime or similar).Experience deploying large models locally or on edge devices with limited memory / compute constraints.Familiarity with multimodal model architectures — e.G., CLIP, Flamingo, LLaVA, or AudioGPT-style systems.Strong software engineering skills (Python, C++, CUDA) and experience integrating models into production systems.Understanding of GPU / accelerator utilization, memory bandwidth optimization, and distributed inference.Preferred Qualifications :
experience-10+ years
Experience with model-parallel or tensor-parallel inference at scale.Contributions to open-source inference frameworks or model serving systems.Familiarity with hardware-aware training or co-optimization of neural networks and hardware.Background in speech, vision, or multimodal ML research.Track record of deploying models that run entirely offline or on embedded / edge systems.Contact : Uday
Mulya Technologies
muday_bhaskar@yahoo.com
"Mining The Knowledge Community"