Location : Remote
Type : Full-time
Experience : 3+ Years
Salary : up to 60K / Month
Role Summary
We are looking for a hands-on AI Data Engineer who can independently manage end-to-end data workflows, including data collection, document processing, dataset preparation, retrieval pipelines, model fine-tuning, and data visualization.
This role requires strong technical skills across Python, automation, ML tooling, and analytical reporting.
Key Responsibilities (Technical) 1. Data Acquisition & Automation
- Build automated data collection workflows using tools such as Firecrawl , Playwright , Scrapy , or similar frameworks
- Extract multi-format documents (PDFs, HTML, text, images)
- Handle large-scale crawling, rate limits, error handling, and scheduling
2. Document Processing & Transformation
Clean and process unstructured documentsApply OCR (Tesseract, PaddleOCR) for scanned filesConvert and structure data using PyPDF2 , pymupdf , BeautifulSoup , etc.Prepare data in formats such as JSON, JSONL, or CSV3. Dataset Preparation
Segment and structure text for ML trainingCreate Q&A datasets, summaries, instruction-response pairs, and labeled textBuild high-quality datasets compatible with fine-tuning frameworks4. Retrieval & Indexing Pipelines
Implement document chunking strategiesGenerate embeddings and manage vector databases ( Qdrant , Pinecone , Weaviate )Build retrieval workflows using LangChain or LlamaIndexOptimize retrieval accuracy and latency5. Model Training & Fine-Tuning
Run fine-tuning jobs using HuggingFace Transformers , LoRA / QLoRA , or similar methodsMonitor training performance and refine datasetsPackage and deploy fine-tuned models6. Data Visualization & Analytics
Create analytical charts, trends, and insights using :PandasMatplotlibSeabornPlotlyBuild simple internal dashboards or visual summaries for reportsTransform raw datasets into meaningful visual insights7. Automation & Infrastructure
Write modular, maintainable Python scriptsContainerize workflows with DockerMaintain version control with GitEnsure reproducibility and pipeline stabilityRequired Technical Skills
Strong proficiency in PythonExperience with Firecrawl , Playwright, Scrapy, or similar toolsStrong background in document parsing , text processing, and OCRFamiliarity with LangChain or LlamaIndexExperience with vector databasesHands-on experience with HuggingFace , Transformer models, and fine-tuningAbility to write clean, efficient data pipelinesExperience with Matplotlib , Seaborn , Plotly , or other visualization toolsComfort using Docker and GitNice to Have
Experience serving models or building small APIs (FastAPI)Exposure to GPU training environmentsBackground in large-scale unstructured data workAbility to create lightweight dashboards (Plotly Dash, Streamlit)Ideal Candidate
Comfortable owning full pipelines independentlyDetail-oriented and analyticalStrong problem-solving abilityCan work with minimal supervisionEnjoys building structured systems from scratch