About the Role :
We are seeking a highly experienced Voice AI / ML Engineer to lead the design and deployment of real-time voice intelligence systems . This role focuses on ASR , TTS , speaker diarization , wake word detection , and building production-grade modular audio processing pipelines to power next-generation contact center solutions , intelligent voice agents , and telecom-grade audio systems .
You will work at the intersection of deep learning , streaming infrastructure , and speech / NLP technology , creating scalable, low-latency systems across diverse audio formats and real-world applications.
Key Responsibilities :
Voice & Audio Intelligence :
- Build, fine-tune, and deploy ASR models (e.g., Whisper , wav2vec2.0 , Conformer ) for real-time transcription.
- Develop and finetune high-quality TTS systems using VITS , Tacotron , FastSpeech for lifelike voice generation and cloning.
- Implement speaker diarization for segmenting and identifying speakers in multi-party conversations using embeddings (x-vectors / d-vectors) and clustering (AHC, VBx, spectral clustering).
- Design robust wake word detection models with ultra-low latency and high accuracy in noisy conditions.
Real-Time Audio Streaming & Voice Agent Infrastructure :
Architect bi-directional real-time audio streaming pipelines using WebSocket , gRPC , Twilio Media Streams , or WebRTC .Integrate voice AI models into live voice agent solutions , IVR automation , and AI contact center platforms .Optimize for latency , concurrency , and continuous audio streaming with context buffering and voice activity detection (VAD).Build scalable microservices to process, decode, encode, and stream audio across common codecs (e.g., PCM , Opus , μ-law , AAC , MP3 ) and containers (e.g., WAV , MP4 ).Deep Learning & NLP Architecture :
Utilize transformers , encoder-decoder models , GANs , VAEs , and diffusion models , for speech and language tasks.Implement end-to-end pipelines including text normalization, G2P mapping, NLP intent extraction, and emotion / prosody control.Fine-tune pre-trained language models for integration with voice-based user interfaces.Modular System Development :
Build reusable, plug-and-play modules for ASR , TTS , diarization , codecs , streaming inference , and data augmentation .Design APIs and interfaces for orchestrating voice tasks across multi-stage pipelines with format conversions and buffering.Develop performance benchmarks and optimize for CPU / GPU, memory footprint, and real-time constraints.Engineering & Deployment :
Writing robust, modular, and efficient Python codeExperience with Docker , Kubernetes , cloud deployment (AWS, Azure, GCP)Optimize models for real-time inference using ONNX , TorchScript , and CUDA , including quantization , context-aware inference , model caching .On device voice model deployment.Why join us?
Impactful Work : Play a pivotal role in safeguarding Tanla's assets, data, and reputation in the industry.Tremendous Growth Opportunities : Be part of a rapidly growing company in the telecom and CPaaS space, with opportunities for professional development.Innovative Environment : Work alongside a world-class team in a challenging and fun environment, where innovation is celebrated.Tanla is an equal opportunity employer. We champion diversity and are committed to creating an inclusive environment for all employees.