Duties and Responsibilities include but are not limited to :
- Design and build ML pipelines for OCR extraction, document image processing, and text classification tasks.
- Fine-tune or prompt large language models (LLMs) (e.g., Qwen, GPT, LLaMA, Mistral) for domain-specific use cases.
- Develop systems to extract structured data from scanned or unstructured documents (PDFs, images, TIFs).
- Integrate OCR engines (Tesseract, EasyOCR, AWS Textract, etc.) and improve their accuracy via pre- / post-processing.
- Handle natural language processing (NLP) tasks such as named entity recognition (NER), summarization, classification, and semantic similarity.
- Collaborate with product managers, data engineers, and backend teams to productionize ML models.
- Evaluate models using metrics like precision, recall, F1-score, and confusion matrix, and improve model robustness and generalizability.
- Maintain proper versioning, reproducibility, and monitoring of ML models in production.
The duties set forth above are essential job functions for the role. Reasonable accommodations may be made to enable individuals with disabilities to perform essential job functions.
Skills and Qualifications
4–5 years of experience in machine learning, NLP, or AI rolesProficiency with Python and ML libraries such as PyTorch, TensorFlow, scikit-learn, Hugging Face Transformers.Experience with LLMs (open-source or proprietary), including fine-tuning or prompt engineering.Solid experience in OCR tools (Tesseract, PaddleOCR, etc.) and document parsing.Strong background in text classification, tokenization, and vectorization techniques (TF-IDF, embeddings, etc.).Knowledge of handling unstructured data (text, scanned images, forms).Familiarity with MLOps tools : MLflow, Docker, Git, and model serving frameworks.Ability to write clean, modular, and production-ready code.Experience working with medical, legal, or financial document processing.Exposure to vector databases (e.g., FAISS, Pinecone, Weaviate) and semantic search.Understanding of document layout analysis (e.g., LayoutLM, Donut, DocTR).Familiarity with cloud platforms (AWS, GCP, Azure) and deploying models at scale