Location : RemoteType : Full-timeExperience : 3+ YearsSalary : up to 70K / Month based on experienceRole Summary We are looking for a hands-on AI Data Engineer who can independently manage end-to-end data workflows, including data collection, document processing, dataset preparation, retrieval pipelines, model fine-tuning, and data visualization.This role requires strong technical skills across Python, automation, ML tooling, and analytical reporting.Key Responsibilities (Technical) 1. Data Acquisition & Automation Build automated data collection workflows using tools such as Firecrawl , Playwright , Scrapy , or similar frameworksExtract multi-format documents (PDFs, HTML, text, images)Handle large-scale crawling, rate limits, error handling, and scheduling2. Document Processing & Transformation Clean and process unstructured documentsApply OCR (Tesseract, PaddleOCR) for scanned filesConvert and structure data using PyPDF2 , pymupdf , BeautifulSoup , etc.Prepare data in formats such as JSON, JSONL, or CSV3. Dataset Preparation Segment and structure text for ML trainingCreate Q&A datasets, summaries, instruction-response pairs, and labeled textBuild high-quality datasets compatible with fine-tuning frameworks4. Retrieval & Indexing Pipelines Implement document chunking strategiesGenerate embeddings and manage vector databases ( Qdrant , Pinecone , Weaviate )Build retrieval workflows using LangChain or LlamaIndexOptimize retrieval accuracy and latency5. Model Training & Fine-Tuning Run fine-tuning jobs using HuggingFace Transformers , LoRA / QLoRA , or similar methodsMonitor training performance and refine datasetsPackage and deploy fine-tuned models6. Data Visualization & Analytics Create analytical charts, trends, and insights using : PandasMatplotlibSeabornPlotlyBuild simple internal dashboards or visual summaries for reportsTransform raw datasets into meaningful visual insights7. Automation & Infrastructure Write modular, maintainable Python scriptsContainerize workflows with DockerMaintain version control with GitEnsure reproducibility and pipeline stabilityRequired Technical Skills Strong proficiency in PythonExperience with Firecrawl , Playwright, Scrapy, or similar toolsStrong background in document parsing , text processing, and OCRFamiliarity with LangChain or LlamaIndexExperience with vector databasesHands-on experience with HuggingFace , Transformer models, and fine-tuningAbility to write clean, efficient data pipelinesExperience with Matplotlib , Seaborn , Plotly , or other visualization toolsComfort using Docker and GitNice to Have Experience serving models or building small APIs (FastAPI)Exposure to GPU training environmentsBackground in large-scale unstructured data workAbility to create lightweight dashboards (Plotly Dash, Streamlit)Ideal Candidate Comfortable owning full pipelines independentlyDetail-oriented and analyticalStrong problem-solving abilityCan work with minimal supervisionEnjoys building structured systems from scratch
Ai Data Engineer • Sangli, Maharashtra, India