Location : Remote
Type : Full-time
Experience : 3+ Years
Salary : up to 70K / Month based on experience
Role Summary We are looking for a hands-on AI Data Engineer who can independently manage end-to-end data workflows, including data collection, document processing, dataset preparation, retrieval pipelines, model fine-tuning, and data visualization.
This role requires strong technical skills across Python, automation, ML tooling, and analytical reporting.
Key Responsibilities (Technical) 1. Data Acquisition & Automation Build automated data collection workflows using tools such as Firecrawl , Playwright , Scrapy , or similar frameworks
Extract multi-format documents (PDFs, HTML, text, images)
Handle large-scale crawling, rate limits, error handling, and scheduling
2. Document Processing & Transformation Clean and process unstructured documents
Apply OCR (Tesseract, PaddleOCR) for scanned files
Convert and structure data using PyPDF2 , pymupdf , BeautifulSoup , etc.
Prepare data in formats such as JSON, JSONL, or CSV
3. Dataset Preparation Segment and structure text for ML training
Create Q&A datasets, summaries, instruction-response pairs, and labeled text
Build high-quality datasets compatible with fine-tuning frameworks
4. Retrieval & Indexing Pipelines Implement document chunking strategies
Generate embeddings and manage vector databases ( Qdrant , Pinecone , Weaviate )
Build retrieval workflows using LangChain or LlamaIndex
Optimize retrieval accuracy and latency
5. Model Training & Fine-Tuning Run fine-tuning jobs using HuggingFace Transformers , LoRA / QLoRA , or similar methods
Monitor training performance and refine datasets
Package and deploy fine-tuned models
6. Data Visualization & Analytics Create analytical charts, trends, and insights using :
Pandas
Matplotlib
Seaborn
Plotly
Build simple internal dashboards or visual summaries for reports
Transform raw datasets into meaningful visual insights
7. Automation & Infrastructure Write modular, maintainable Python scripts
Containerize workflows with Docker
Maintain version control with Git
Ensure reproducibility and pipeline stability
Required Technical Skills Strong proficiency in Python
Experience with Firecrawl , Playwright, Scrapy, or similar tools
Strong background in document parsing , text processing, and OCR
Familiarity with LangChain or LlamaIndex
Experience with vector databases
Hands-on experience with HuggingFace , Transformer models, and fine-tuning
Ability to write clean, efficient data pipelines
Experience with Matplotlib , Seaborn , Plotly , or other visualization tools
Comfort using Docker and Git
Nice to Have Experience serving models or building small APIs (FastAPI)
Exposure to GPU training environments
Background in large-scale unstructured data work
Ability to create lightweight dashboards (Plotly Dash, Streamlit)
Ideal Candidate Comfortable owning full pipelines independently
Detail-oriented and analytical
Strong problem-solving ability
Can work with minimal supervision
Enjoys building structured systems from scratch
Ai Data Engineer • Vizianagaram, Andhra Pradesh, India